Syntactic Image Parsing using Ontology and Semantic Descriptions Ifeoma Nwogu University of Rochester Rochester, NY inwogu@cs.rochester.edu Venu Govindaraju University at Buffalo, SUNY Buffalo, NY govind@buffalo.edu Chris Brown University of Rochester Rochester, NY brown@cs.rochester.edu Abstract We present an ontology-guided, symbol-based, image parser which involves the use of semantic, spoken lan- guage descriptions of entities in images as well as the real-world spatial relationships deﬁned between these enti- ties. Our parsing approach explicitly describes objects and the relationships between them with linguistically meaning- ful modes of colors, textures and [coarse] expressions of shapes. The image parser is built on a syntactic image grammar- based framework and performs a (near) global optimization using superpixels as an initial set of subpatterns. It hypoth- esizes the entities in images using their local semantic at- tributes and veriﬁes them globally using their more global features and their relative spatial locations,. Evaluations of the parser are performed on selected images which we make publicly available along with their manual segmenta- tions and our labeling results. 1. Introduction In our image parsing approach, given a 2D image of the real-world, we ﬁrst decide whether the image is in the class of outdoor or indoor images. Within these classes exist a set of subdomains (or locations) as described by Zhu et al.[26]. Some examples of subdomains in their indoor class include bathroom, hall, ofﬁce etc. while examples in the outdoor subdomain include street, highway, cityview, seashore etc. Numerous approaches have been presented in the literature that accomplish this task of scene categorization quite suc- cessfully [11, 13, 24]. We therefore assume a given subdomain and explicitly build an ontology to deﬁne the entities we expect to ﬁnd in it. Within the ontology, the parts-relationships (or mere- ology) of the entities are deﬁned, for example, the sky is made up of both clouds and clear sky; The relationships be- tween the entities in the vocabulary are are deﬁned in the ontology in the form of a syntactic tree, which deﬁnes how pairs of entities can relate to each other (topological rela- tionships). Finally, the ontology also describes any con- straints imposed, to ensure that only certain syntax relations are allowed. For example, if the sea and sky touch (in 2D images), due to perspective geometry, the border between them is a (near) straight line of zero slope. Such syntax rela- tions make up the grammatical rules for parsing. Hence, we capture and formally record human-based knowledge about entities and their parts, their attributes and relationships us- ing an ontology. The summary of the proposed approach is as follows: given any input image (assumed to be in one of the pre- established sub-domains), the image is over-segmented to yield a set of ”superpixels”, a locally, coherent grouping of image pixels that preserve the structure necessary for image parsing. Every entity in the image is therefore made up of one or more superpixels or subpatterns. A graph is deﬁned over the superpixels and a Markov random ﬁeld (MRF) is imposed on the resulting irregular graph. Low-level seman- tic features are extracted from the superpixels and these rep- resent the pattern primitives that will initiate the parsing process. Grammar production rules in form of ontologi- cal rules are used for syntactically reconstructing the scene from the graph of superpixels. Each superpixel stochasti- cally takes on different entity labels and the sequence of labels that yields the best explanation for the image scene, according to the syntax relations, represents the ﬁnal labels of the parsed image. We summarize our contributions as follows: (i) where many approaches perform entity and scene cate- gorization in an ad-hoc manner (categorizing unrelated entities - tigers, computer screens, grand pianos, trees, snow, and cars in one algorithm), we present a formal approach to modeling real-world entities and their re- lationships, using the ontology of subdomains. The advantage here is that our model can be extended uni- versally to eventually encompass most real-world en- tities; (ii) by implementing our parser formally, we pre-deﬁne a vocabulary of entities, their syntax relations and the rules of production. Hence, given any unconstrained 41 978-1-4244-7030-3/10/$26.00 ©2010 IEEE