Learning and Recognition of Objects Inspired by Early Cognition Maja Rudinac, Gert Kootstra, Danica Kragic and Pieter P. Jonker Abstract—In this paper, we present a unifying approach for learning and recognition of objects in unstructured environ- ments through exploration. Taking inspiration from how young infants learn objects, we establish four principles for object learning. First, early object detection is based on an attention mechanism detecting salient parts in the scene. Second, motion of the object allows more accurate object localization. Next, acquiring multiple observations of the object through manipu- lation allows a more robust representation of the object. And last, object recognition beneﬁts from a multi-modal represen- tation. Using these principles, we developed a unifying method including visual attention, smooth pursuit of the object, and a multi-view and multi-modal object representation. Our results indicate the effectiveness of this approach and the improvement of the system when multiple observations are acquired from active object manipulation. I. INTRODUCTION Bringing artiﬁcial systems to real-world environments poses many different problems that must be solved. One of the challenges is to recognize objects despite the uncontrolled nature of the real world. Variations in object appearance due to viewpoint or environmental conditions need to be overcome by the system. In this paper, we approach this chal- lenge by taking inspiration from object learning in infants. As deﬁned in the cognitive theory of Piaget [18], infants learn representations of objects by actively exploring them. Doing so, allows to observe the objects from different viewpoints, and thus exploring the possible variations in appearance. In early stages in child development, the infant’s visual attention is directed primarily to salient parts of the environment [22]. The child will ﬁrst be able to learn representations of the objects that are actively shown by the caregivers [9]. In later stages, the infant will learn to manipulate and explore the objects independently [20]. In this paper, we aim to mimic the early stage of object learning on an artiﬁcial cognitive system, using a caregiver to demonstrate objects by manipulating them. We believe it is important to start at an early stage, in order to develop and test important concepts in object learning. Future versions of our system will develop in line with child development, as advocated in [28]. Figure 1 shows our cognitive model of object learning and recognition. The model is based on Baddeley’s model of working memory [1] and Knudsen’s model of attention [10]. We narrow both models down to the parts that deal with visual information. The central executive is responsible for the control of cognitive processes and, in our approach, has the coordinating role in visual learning and recognition, involving the long-term memory, which stores the object representations. Additionally, it is involved in the control of Working memory Central executive Color Texture Shape Visual memory Long term memory Declarative memory WORLD Smooth pursuit Saccades Attention Visual Attention Color Intensity Motion Eye Movements Fig. 1: Cognitive model for object learning and recognition based on Baddeley’s model of working memory [1] and Knudsen’s model of attention [10]. attention. The visual memory (termed visio-spatial sketchpad in [1]) holds the visual information of the attended regions, such as color, texture, and shape information. Our model furthermore includes an attention mechanism as an interface between the working memory and the outside world, as proposed in Knudsen’s model of attention [10]. Attention is focused on relevant parts of the visual ﬁeld based on different types of visual information, such as color, intensity, and motion. The focus of attention can be changed to different parts of the visual ﬁeld through saccadic eye movements or in order to track an object through smooth pursuit. The ﬁrst problem that arises in learning, is how to localize and segment unknown objects from the background. For the detection of unknown objects in a scene, no top-down knowledge can be used. Object-detection methods based on 3D point clouds calculated from stereo-image pairs [2] provide good results in the case of the textured objects. However, they fail in the case of uniform colored objects which are widely present in the environment. As a solution to this challenging problem, we therefore consider bottom-up