RIPLEY, HAND ME THE CUP! (SENSORIMOTOR REPRESENTATIONS FOR GROUNDING WORD MEANING) Deb Roy, Kai-Yuh Hsiao, Nikolaos Mavridis, and Peter Gorniak Cognitive Machines Group MIT Media Laboratory www.media.mit.edu/cogmac ABSTRACT People leverage situational context when using language. Rather than convey all information through words, listeners can infer speakers’ meanings due to shared common ground [1, 2]. For machines to engage fully in conversation with humans, they must also link words to the world. We present a sensorimotor representation for physically grounding ac- tion verbs, modifiers, and spatial relations. We demonstrate an implementation of this framework in an interactive robot that uses the grounded lexicon to translate spoken commands into situationally appropriate actions. 1. SITUATED SPOKEN LANGUAGE Speakers use spoken language to convey meaning to lis- teners by leveraging situational context. Context includes many levels of knowledge ranging from fine grain details of shared physical environments to shared cultural norms. As the degree of shared context decreases between commu- nication partners, the efficiency of language also decreases since the speaker is forced to explicate increasing quantities of information that could otherwise be left unsaid. A suf- ficient lack of common ground can lead to communication failures. If machines are to engage in meaningful, fluent, situated spoken dialog, they must be aware of their situational con- text. As a starting point, we focus our attention on physical context. A machine that is aware of where it is, what it is do- ing, the presence and activities of other objects and people which are in its vicinity, and salient aspects of recent his- tory, can use these contextual factors to understand spoken language in a context-dependent manner. A concrete example helps illustrate how a machine can make use of situational context. Consider a speech interface to the lights in a room 1 . If a person simply says, “Lights!”, the appropriate action will depend on the current state of the light. If it is already on, the command means turn off, 1 Ignoring, for the moment, the difficult issue of microphone placement and background noise that would also need attention. but if it is already off, it means the opposite. In this simple example, the language understander needs access to a sin- gle bit of situational context, the current state of the light. Consider a slightly richer problem, still in the domain of the light controller. How should the spoken command softer be interpreted by the light? Perhaps the simplest solution would be to decrease the intensity of the light by a fixed amount. Although this solution might be functional, it is not necessarily the most natural. In contrast to a fixed-interval solution, a person responding to this request would be likely to decrease the intensity by an amount that is a function of the intensity of light in the room at the time of the request. In general, many sources of light (e.g., from a setting sun) may contribute to the total ambient light in the room. For a machine to leverage this situational information, we could add a light sensor to the controller that is able to monitor ambient lighting conditions. A context-dependent interpre- tation of “softer” could then be defined. 1.1. Language Grounding A necessary step towards creating situated speech process- ing systems is to develop representations and procedures that enable machines to ground the meaning of words in their physical environments. In contrast to dictionary defini- tions that represent words in terms of other words (leading, inevitably, to circular definitions for all words), grounded definitions anchor word meanings in non-linguistic prim- itives. Assuming that a machine has access to its environ- ment through appropriate sensory channels, language ground- ing enables machines to link linguistic meanings to elements of the machine’s environment. From environmentally aware light controllers to car nav- igation systems that see the same visual landmarks as the driver, the idea of a context-grounded speech processing is the tip of a very large iceberg. We believe that a large class of spoken language understanding applications may benefit from language grounding. We will refer to this class of sys- tems as having grounded semantics in light of the explicit links of semantic representations to the machine’s physical