Interactive Learning and Cross-Modal Binding - A Combined Approach * Henrik Jacobsson 1 , Nick Hawes 2 , Danijel Skoˇ caj 3 , Geert-Jan M. Kruijff 1 1 Language Technology Lab, DFKI GmbH, Germany, 2 School of Computer Science, University of Birmingham, UK 3 University of Ljubljana, Slovenia, Introduction To function properly in the world, a cognitive system should possess the ability to learn and adapt in a continu- ous, open-ended, life-long fashion. This learning is inher- ently cross-modal; the system should use all of its percepts and capabilities to sense and understand the environment, and update the current knowledge accordingly. For the life-long learning to be effective, it is also important to be able to incorporate knowledge from other knowledgable cognitive systems through interactive learning. For this to be “socially acceptable”, it is important to support a wide variety of tutoring channels. For example, to treat the tutor only as a source for linguistic labels is not a natural way of communication and is thus not very effective from the hu- man’s point of view. For an excellent and deep account of proper design considerations for socially interactive learn- ing systems see (Thomaz, 2006) A prerequisite for interactive learning is the successful interpretation of the meaning of references used in dia- logue with a human. The robot must therefore be able to form associations between information in different modal- ities, e.g., between linguistic references and visual input (Roy, 2005). Forming these associations is a process we refer to as cross-modal binding. We are developing a mul- tifaceted approach to binding, and in this extended abstract we address the virtue of the symbiosis of binding and in- teractive learning. Cross-Modal Binding We treat the binding of linguistic and visual content as an instance of a broader cross-modal binding problem: to enable a broad and open-ended set of modalities to con- tribute towards a common representation of abstract con- cepts, objects, and actions, and N-ary relations between them. For example, for a robot to successfully determine the correct response to the “give me the blue mug that’s to the right of the plate” it must be able to correctly interpret the references to the objects, the action, and the spatial relationship. Typical robotic systems are composed of specialised subsystems, e.g. vision, manipulation, dialogue, reason- ing etc. For N subsystems there are N(N - 1)/2 poten- * This work was supported by the EU FP6 IST Cognitive Sys- tems Integrated Project “CoSy” FP6-004250-IP. tial interfaces between them. Building associations in this manner can quickly become expensive to manage both at design- and run-time. To avoid this, we employ a two- level approach to binding. The bottom level corresponds to subsystem specific representations. The second level represents objects, actions and relations by bundling to- gether sets of features abstracted from the first level repre- sentations. These “bundles” represent a subsystem’s best hypotheses about the objects, actions and relations in its modality. To build a common representation from all its subsystems, a number of binding processes then operate on this more abstract level of information. This is illus- trated in Figure 1. Further information is available in pre- vious work (Hawes et al., 2007). The focus of this abstract is that the information used to associate features across modalities may be learned, and that this two-level system naturally supports such learning. Cross-Modal Learning When the binding processes establish associations be- tween bundles of abstracted features, these associations implicitly link features from these modalities. Some of these links will represent known cross-modal mappings between features, but others may represent valid mappings that the system does not know about. For example, in the utterance “give me the blue mug that’s to the right of the plate” visual colour features (blue pixels) may be im- plicitly linked to linguistic colour features (“blue”) via an association formed from a type description (“the mug”). When the binding of the object descriptions succeed, the binder can generate novel training examples for a learn- ing module. In the case above, the binder would generate the training examples for updating the representations of “blue”, “the mug”, “to the right of”, and “the plate”. In this way, the system can increase its current knowledge without being explicitly instructed, and without training examples being provided separately. An idealised learner would try to use all the inferred information and data from all modalities to co-train (cf. Levin et al., 2003) its repre- sentations in other modalities as well. Any learning method using binding processes for train- ing will be thus fed by a stream of examples of cross- modal associations. The open-ended nature of this in- put makes it important that any learning systems used are incremental; the learning process should continue to im-