DRAFT: DO NOT QUOTE OR CITE 1 Coupling Robot Perception and Online Simulation for Grounding Conversational Semantics Deb Roy, Kai-Yuh Hsiao, and Nikolaos Mavridis Abstract— How can we build robots that engage in ﬂuid spoken conversations with people, moving beyond canned responses and towards actual understanding? Many difﬁcult questions arise regarding the nature of word meanings, and how those meanings are grounded in the world of the robot. We introduce an architecture that provides the basis for grounding word meanings in terms of robot perception, action, and memory. The robot’s perceptual system drives an online simulator that maintains a virtual version of the physical environment in synchronization with the robot’s noisy and changing perceptual input. The simu- lator serves as a “mental model” that enables object permanence and virtual shifts of perspective. This architecture provides a rich set of data structures and procedures that serve as a basis set for grounding lexical semantics, a step towards situated, conversational robots. Index Terms— Robots, Natural language interfaces, Knowledge representation, Active vision, Simulation. I. I NTRODUCTION L ANGUAGE enables people to talk about the world. Through language, we are able to refer to the past and future, and to describe things as they are or how we imagine them. For a robot to use language in human-like ways, it must ground the meaning of words in its world as mediated by perception, action, and memory. Many words that refer to things in the world can be grounded through sensory-motor associations. For instance, the meaning of ball includes perceptual associations that encode how balls look and predictive models of how balls behave. The representation of touch includes procedural associations that encode how to perform the action, as well as perceptual encodings to recognize the action in others. Words thus serve as labels for perceptual or action concepts that are anchored in sensory- motor representations. When a word is uttered, the underlying concept is communicated to the degree that the speaker and listener maintain similar associations. This basic approach underlies most work to date in building machines that ground language [1]–[8]. Not all words, however, can be grounded in sensory-motor representations. In even the simplest conversations about ev- eryday objects, events, and relations, problems arise. Consider a person and a robot sitting across a table from each other, engaged in coordinated activity involving manipulation of objects. After some interaction, the person says to the robot: Touch the heavy blue thing that was on my left. To understand and act on this command, the robot must bind words of this utterance to a range of representations: touch can be grounded in a visually-guided motor program that enables the robot to move towards and touch ob- jects. This is an example of a procedural association that depends on perception to guide action. heavy speciﬁes a property of objects which involves af- fordances [9]. A light object affords manipulation whereas a sufﬁciently heavy one does not. To rep- resent affordances, both procedural and perceptual representations must be combined. blue speciﬁes a visual property, an example of perceptual grounding. thing must be grounded in terms of both perception and affordances (one can see an object, and expect to reach out and touch it). was triggers a shift of perspective in time. Words and in- ﬂections that mark tense must cue the robot to “look” back in time to successfully ground the referent of the utterance. my triggers a shift of perspective in space. As opposed to your left or simply left (which would be ambiguous), my tells the listener to look at the world from the speaker’s point of view. left can be grounded in visual features that compute linguistically-salient spatial relations between objects within an appropriate frame of reference. We have developed an architecture in which a robotic manipulator is coupled with a physical simulator. By virtue of the robot’s sensory, motor control, and simulation processes, a set of representations are obtained for grounding each of the kinds of words listed above 1 . The robot, called Ripley (Figure 1), is driven by compliant actuators which enable it to manipulate small objects. Ripley has cameras, touch, and various other sensors on its “head”. Force and position sensors in each actuated joint provide a sense of proprioception. Ripley’s visual and proprioceptive systems drive a physical simulator that keeps a constructed version of the world (including a representation of itself) in synchronization with Ripley’s noisy perceptual input. An object permanence module determines when to instantiate and 1 We acknowledge that the words in this example, like most words, have numerous additional connotations that are not captured by the representations that we have suggested. For example, words such as touch, heavy and blue can be used metaphorically to refer to emotional actions and states. Things are not always physical perceivable objects, my usually indicates possession, and so forth. Barwise and Perry use the phrase “efﬁciency of language” to highlight the situation-dependent reusability of words and utterances [10]. Given the utterance and context that we described, the groundings listed above are sufﬁcient. Other senses of words may be metaphoric extensions of these embodied representations [11].