Data Oriented Picture Parsing: A hybrid model Dave Cochran Linguistics and English Language School of Philosophy, Psychology and Language Sciences University of Edinburgh The following paper is an outline proposal for an algorithm for visual parsing which hybridises Data Oriented Parsing (Bod 1998 for an overview) with Connectionist networks. Bod (2005) notes that trees are structures which connect higher level cognitive representations to lower level cognitive representations. It therefore necessary to propose other architectures to handle the lowest level of feature recognition, and it is here that I wish to integrate neural nets into my model of Data- Oriented Vision. This model gains in robustness by calculating probabilities by aggregating the output of the feature-recognition nets (as probabilities) with the probabilities of Stochastic Tree-Substitution derivations. 0. Introduction The following paper is an outline proposal for an algorithm for visual parsing which hybridises Data Oriented Parsing (Bod 1998 for an overview) with Connectionist networks. Bod (2005) notes that trees are structures which connect higher level cognitive representations to lower level cognitive representations. It therefore necessary to propose other architectures to handle the lowest level of feature recognition, and it is here that I wish to integrate neural nets into my model of Data- Oriented Vision. This model gains in robustness by calculating probabilities by aggregating the output of the feature-recognition nets (as probabilities) with the probabilities of Stochastic Tree-Substitution derivations. My purpose in wishing to implement this algorithm is to provide a basis for grounded semantic representations for Data-Oriented Generation: More specifically, I wish to use paired corpora of images and their descriptions as the exemplar base for a Data-Oriented Generator which I hope will be able to produce grammatical, true descriptive sentences when presented with novel visual input. Before proceeding, one caveat must be noted. This paper outlines an algorithm which has yet to be implemented. As such, some of the details of the implementation must be left unspecified, simply because I do not yet have a clear idea about how they shoud be specified, and probably will not until I have had the chance to try a few things out and see what works. 1. Training Data For the first implementation of this model, I do not with to overtax it by using complicated visual scenes for which human viewers’ analysis would be modulated by imagining the static two-dimensional image as a representation of a dynamic, three dimensional event, or even imagining parts of the picture to represent intensional agents. However, if the images are to be paired with natural language descriptions, abstract images will probably not elicit sufficiently rich and precise descriptions. I therefore propose to use images comprising Roman letters and characters from other writing systems, in black print on a white background; for an example see figure 1. Using letters has a number of advantages: