Combined Structure and Motion Extraction from Visual Data Using Evolutionary Active Learning Krishnanand N. Kaipa Department of Computer Science University of Vermont Burlington VT, USA kkrishna@uvm.edu Josh C. Bongard Department of Computer Science University of Vermont Burlington VT, USA jbongard@uvm.edu Andrew N. Meltzoff Institute for Learning and Brain Sciences University of Washington Seattle WA, USA meltzoff@u.washington.edu ABSTRACT We present a novel stereo vision modeling framework that generates approximate, yet physically-plausible representa- tions of objects rather than creating accurate models that are computationally expensive to generate. Our approach to the modeling of target scenes is based on carefully selecting a small subset of the total pixels available for visual pro- cessing. To achieve this, we use the estimation-exploration algorithm (EEA) to create the visual models: a population of three-dimensional models is optimized against a grow- ing set of training pixels, and periodically a new pixel that causes disagreement among the models is selected from the observed stereo images of the scene and added to the train- ing set. We show here that using only 5 % of the available pixels, the algorithm can generate approximate models of compound objects in a scene. Our algorithm serves the dual goals of extracting the 3D structure and relative motion of objects of interest by modeling the target objects in terms of their physical parameters (e.g., position, orientation, shape, etc.), and tracking how these parameters vary with time. We support our claims with results from simulation as well from a real robot lifting a compound object. Categories and Subject Descriptors I.2.10 [Artificial Intelligence]: Vision and Scene Under- standing—modeling and recovery of physical attributes General Terms Algorithms, design, experimentation 1. INTRODUCTION Modeling of 3D objects from 2D images is still an unsolved computer vision problem. We present here a novel modeling framework that enables a stereo vision system to rapidly cre- ate simulations of what it observes. This process is based on carefully selecting a small subset of pixels from the images of the left and the right cameras, and using them to train a population of three-dimensional, physically-realistic models Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’09, July 8–12, 2009, Montréal, Québec, Canada. Copyright 2009 ACM 978-1-60558-325-9/09/07 ...$5.00. of the target scene. New pixels are requested from the cam- era images based on disagreements among the models. The extracted pixel is added to the training set and modeling continues. We show here that using only 5 % of the pixels, the algorithm generates physically-plausible models of the target scene. The vision literature abounds with techniques that are capable of localizing and tracking moving objects in a scene. However, our algorithm is driven by the dual goals of extracting the 3D structure and relative motion of objects. This is achieved by modeling the target objects in terms of their physical parameters (e.g., position, orientation, shape, etc.), and tracking how these parameters vary with time. In particular, we infer that the object has undergone a change in one of its parameters if there is a statistically significant change in its value between the initial frame and the final frame of the corresponding video footage; and that two ob- jects are components of a same compound object if there is no statistically significant change in the distance between them over successive frames. We support our claims with the results of structure and motion extraction experiments con- ducted against simulated scenes and video footage of a real robot lifting a compound object. Moreover, we show that this method results in selection of pixels around the edges of observed objects, therefore leading to automated edge de- tection. The proposed method caters to specific visual per- ception requirements in social robotics. A primary goal in this context is to obtain models that can be used to simulate the physical repercussions of a teacher’s actions, which is not possible using geometric representations. In order to equip social robots with rapid responses to their perceived environ- ments, approximate yet physically-plausible representations of the observed entities take precedence over creating accu- rate models that are computationally expensive to generate. The paper is organized as follows. Related prior vision approaches are outlined in Section 2. Scene modeling is described in Section 3.1. The estimation-exploration algo- rithm and its application to the vision modeling problem addressed here are presented in Section 3.2. Experimental results illustrating the basic working of the algorithm are reported in Section 4. The structure and motion extraction experiments comprising the main results of the paper are presented in Section 5. A discussion outlining some issues faced by the algorithm and remedies that can be explored in future work is provided in Section 6. Conclusions are discussed in Section 7. 2. PRIOR MODELING APPROACHES Computer vision is a vast research area with innumer- able techniques that have been developed in order to address problems such as object detection, object recognition, mo- 121