Affordance based word-to-meaning association V. Krunic G. Salvi A. Bernardino L. Montesano J. Santos-Victor Abstract— This paper presents a method to associate mean- ings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. I. I NTRODUCTION To interact with humans, a robot needs to communicate with people and understand their needs and intentions. The by far most natural way for a human to communicate is language. This paper deals with the acquisition by a robot of language capabilities linked to manipulation tasks. Our approach draws inspiration from infant cross situational word learning theories that suggest that infant learning is an iter- ative bootstrapping process [12]. It occurs in an incremental way (from simple words to more complex structures) and involves multiple tasks such as word segmentation, speech production, and meaning discovery. Furthermore, it is highly coupled with other learning process such as manipulation, for instance, in mother infant interaction schemes [8]. Out of the multiple aspects of language acquisition, this paper focuses on the ability to discover the meaning of words through human-robot interaction. We adopt a developmental robotics approach [18], [10] to tackle the language acquisi- tion problem. In particular, we consider the developmental framework of [11] where the robot first explores its sensory- motor capabilities. Then, it interacts with objects and learns their affordances, i.e. relations between actions and effects. The affordance model uses a Bayesian network to capture the statistical dependencies among a set of robot basic manipulation actions (e.g. grasp or tap), object features and the observed effects by means of statistical learning techniques exploiting the co-occurrence of stimuli in the sensory patterns. V. Krunic, G. Salvi, A. Bernardino, L. Montesano, J. Santos-Victor are with the Instituto de Sistemas e Rob´ otica, Instituto Superior T´ ecnico, Lisboa, Portugal. {vkrunic,gsalvi,alex,lmontesano,jasv}@isr.ist.utl.pt G. Salvi is currently with the Speech, Music and Hearing lab at the Royal Institute of Technology (KTH), Stockholm, Sweden This work was supported by EU NEST Project 5010 - Contact, and by Fundac¸˜ ao para a Ciˆ encia e Tecnologia (ISR/IST plurianual funding) through the POS Conhecimento Program that includes FEDER funds. The main contribution of the paper is the inclusion in the affordance model [11] of verbal descriptions, provided by a human, of the robot activities. The model exploits temporal co-occurrence to associate speech segments to the meanings in terms of actions, object properties and the corresponding effects. Despite we do not use any social cues or the number and order of words, the model provides the robot with the means to learn and refine the meaning of words in such a way that it will develop a rough understanding of speech based on its own experience. Our model has been evaluated using a humanoid torso able to perform simple manipulation tasks and to recognize words from a basic dictionary. We show that simply measuring the frequencies of words with respect to a self-constructed model of the world, the affordance network, is enough to provide information about the meaning of these utterances even with- out considering prior semantic knowledge or grammatical analysis. By embedding the learning into the robot’s own task representation, it is possible to derive links between words such as nouns, verbs and adjectives and the properties of the objects, actions and effects. We also show how the model can be directly used to instruct the robot and to provide contextual information to the speech recognition system. The rest of the paper is organized as follows. After discussing related work, Section III briefly describes, through our particular robotic setup, the problem and the general approach to be taken in the learning and exploitation phases of the word-concept association problem. Section IV presents the language and manipulation task model and the algorithms used to learn and make inferences. In Section V we describe the experiments and provide some details on the speech recognition methods employed. Results are presented in Section VI and finally, in Section VII, we conclude our work and present ideas for future developments. II. RELATED WORK Computational models for cross situational word learning have only been studied recently. One of the earliest works is the one by Siskind [15] who proposes a mathematical model and algorithms for solving an approximation of the lexical-acquisition task faced by children. The paper includes computational experiments, using a rule based logical in- ference system, that shows that the acquisition of word-to- meaning mappings can be performed by constraining the possible meanings of words given their context of use. They show that acquisition of word-to-meaning mappings might be possible without knowledge of syntax, word order or reference to properties of internal representations other than