Conceptual description of visual scenes from linguistic models A. Mukerjee a, * , K. Gupta b , S. Nautiyal b , M.P. Singh c , N. Mishra d a Center for Robotics, Indian Institute of Technology, Kanpur, India b Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India c MIT Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA d Hindustan Lever Ltd, Bombay, India Received 18 September 1997; received in revised form 19 December 1997; accepted 13 July 1999 Abstract As model-based vision moves towards handling imprecise descriptions like a long bench is in front of the tree, it has to confront questions involving widely variable shapes in unclear positions. Such descriptions may be said to be “conceptual” in the sense that they provide a loose set of constraints permitting a range of instantiations for the scene. One of the validations of a computational system’s ability to handle such descriptions is provided by immediate visualization, which tells the user whether the bench is of the right shape and has been positioned correctly. Such a visualization must handle impreciseness in Shape and Spatial Pose, and, for dynamic vision, Object Articulation and Motion Parameters as well. The visualization task is a concretization which consists of generating an “instance” of the scene/action being described. The principal requirement for concretizing the conceptual model is a large visual database of objects and actions, along with a set of constraints corresponding to default dependencies in the domain. In our work, the resulting set of constraints is combined using multi- dimensional fuzzy functions called continuum fields (potentials). A set of experiments was conducted to determine the parameters of these continuum fields. An instance is generated by identifying minima in the continuum fields involved is generated by identifying minima in the continuum fields involved in generating the shape, position and motion. These are then used to create default instantiations of the objects described. The resulting image/animation may be considered to be the “most likely” visualization, and if this matches the linguistic description, the continuum fields selected are a good model for the conceptual content in the linguistic model of the scene. We present examples of scene reconstruction from conceptual descriptions of urban parks. 2000 Elsevier Science B.V. All rights reserved. Keywords: Visual scenes; Continuum fields; Linguistic models 1. Introduction Consider a visual surveillance task, where the supervisor would like to provide a description for “suspicious” beha- vior in the aisles. One would be tempted to label the behavior suspicious “If the person approaches different aisle locations, hesitates, looks left and right, and then quickly picks up an object, etc.” Such a model describes the behavior at a sufficiently abstract level, and provides a simple effective means of constraining the visual search process to certain aspects of the dynamic scene. Yet, current model-based vision techniques have no mechanism for handling such input. Even without the linguistic aspects, it is clear that models need to be constructed for dynamic motions such as “approach”, “hesitate”, “look left and right” etc. Interpreting such actions depends on being able to identify the spatial parameters of the action, which may involve impreciseness in several geometry and motion parameters. Traditional techniques of model-based vision use geometric models originally designed for CAD applications [1,2]. Unfortunately, the CAD models were intended to meet the requirement that the model be “unambiguous” [3]—an important attribute when it comes to manufacturing or visualizing the part. Further, many conceptualizations of visual scenes incorporate motion with widely variable inter- pretations, e.g. The man goes to the woman and gives the flower to the woman. Such “conceptual descriptions” are much better expressed in linguistic terms (as in this example) rather than as a set of joint motion histories of articulated objects with precise geometries. In particular, the fine definition required of a CAD-style geometric model fails to handle the shape and position variation, and the animation histories preserved as joint Image and Vision Computing 18 (2000) 173–187 IMAVIS 1657 0262-8856/00/$ - see front matter 2000 Elsevier Science B.V. All rights reserved. PII: S0262-8856(99)00022-0 www.elsevier.com/locate/imavis * Corresponding author. E-mail addresses: amit@iitk.ernet.in (A. Mukerjee); mukes@media.mit. edu (M.P. Singh).