758 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 3, MAY 1997 SOIM: A Self-Organizing Invertible Map with Applications in Active Vision Narayan Srinivasa and Rajeev Sharma Abstract—We propose a novel neural network called the self- organized invertible map (SOIM) that is capable of learning many-to-one functionals mappings in a self-organized and on-line fashion. The design and performance of the SOIM are high- lighted by learning a many-to-one functional mapping that exists in active vision for spatial representation of three-dimensional point targets. The learned spatial representation is invariant to changing camera conﬁgurations. The SOIM also possesses an invertible property that can be exploited for active vision. An efﬁcient and experimentally feasible method was devised for learning this representation on a real active vision system. The proof of convergence during learning as well as conditions for invariance of the learned spatial representation are derived and then experimentally veriﬁed using the active vision system. We also demonstrate various active vision applications that beneﬁt from the properties of the mapping learned by SOIM. Index Terms—Active vision, invertible map, motion detection, neural networks, robot control, saccade sequencing, spatial rep- resentation. I. INTRODUCTION R ESEARCHERS have been studying the use of artiﬁcial neural networks (ANN) as a nonlinear function approx- imation tool. In fact, ANN’s have been successfully used for a large variety of function approximation applications including pattern recognition and computer vision [8], [29], [30], adaptive signal processing [36], [20], and the control of highly nonlinear dynamical systems [6], [21]. In pattern recognition applications, ANN’s are used to construct pattern classiﬁers that are capable of separating patterns into distinct classes. In signal processing and control applications, ANN’s are used to build a model of some physical system based on data in the form of examples that emulate the behavior of the system. Here the ANN is essentially used as a tool to extract the functional mapping between the inputs and outputs of the system without making assumptions about its functional form. A special instance of these functional approximation prob- lems involves the learning of a many-to-one functional map- ping. For systems that exhibit this kind of a mapping, there exists only a single output that maps to many inputs to the sys- tem. This kind of mapping is particularly relevant in the con- text of extracting invariant properties of a functional mapping wherein the single output value provides an invariant charac- Manscript received March 19, 1996; revised August 12, 1996. N. Srinivasa is with the Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana, IL 61801 USA. R. Sharma is with the Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802 USA. Publisher Item Identiﬁer S 1045-9227(97)02766-5. Fig. 1. An illustration of a many-to-one mapping in active vision. teristic of the function for the given set of input values. An example of a many-to-one mapping can be found in the task of spatially representing three-dimensional (3-D) point targets for an active vision system. Consider two stationary 3-D point targets and as depicted in Fig. 1. In this ﬁgure, an active stereo camera system is ﬁxated on a point target (i.e., the image is registered in the center of both the image planes). For this camera conﬁguration , the representation of the point target (if also visible to the camera) must be identiﬁed as being further away from the active vision system in comparison to target . If the camera conﬁguration changes to ﬁxate on points or in space (refer to Fig. 1) such that and are still visible, the representation of the two point targets must not change (from that obtained by ﬁxating on ) despite a change in their image locations on the camera. In other words, there exists many combinations of signals that correspond to each 3-D target. This means that the representation of the 3-D targets must be invariant to changing camera conﬁgurations. This process can be more formally described as follows. If we deﬁne as the vision vector that is given by the following relation: (1) where is the function that captures the image projection for the stereo system, then we can deﬁne the spatial representation of a 3-D point (that is to be learned) as (2) or (3) 1045–9227/97$10.00  1997 IEEE