From Proc. Second International Conference on Automatic Face and Gesture Recognition, IEEE Computer Society Press, Killington, VT, Oct. 1996. Visual Interaction With Lifelike Characters Matthew Turk Vision Technology Group Microsoft Research One Microsoft Way Redmond, WA 98052-6399 mturk@microsoft.com Abstract This paper explores the use of fast, simple computer vision techniques to add compelling visual capabilities to social user interfaces. Social interfaces involve the user in natural dialog with animated, “lifelike” characters. However, current systems employ spoken language as the only input modality. Used effectively, vision can greatly enhance the user’s experience interacting with these characters. In addition, vision can provide key information to help manage the dialog and to aid the speech recognition process. We describe constraints imposed by the conversational environment and present a set of “interactive-time” vision routines that begin to support the user’s expectations of a seeing character. A control structure is presented which chooses among the vision routines based on the current state of the character, the conversation, and the visual environment. These capabilities are beginning to be integrated into the Persona lifelike character project. 1. Introduction Human interactions with machines are inherently and unavoidably social. We respond to computers as if they were human, and the social and emotional aspects of that interaction is an important area of user interface research [1]. Social interfaces involve computer-generated characters which attempt to interact with people in natural ways. Current examples of these “lifelike characters” can understand human speech in limited domains and exhibit behavior that appears personable and intelligent. Such interfaces are more compelling in many situations than the more traditional techniques using dialog boxes, command lines, and stored presentations. However, virtually all lifelike characters are currently blind, with no visual knowledge of the human participants or their environment. We are attempting to impart visual abilities to social interfaces so that the characters know if someone is there, how many people are there, where the participants are looking, what they are doing, etc. The integration of these capabilities will enable a much richer, more compelling experience for people interacting with lifelike characters—and with technology in general. For the environment to be believable and compelling for the user, human interaction with computer-based characters must be similar to normal human-human interaction. Dialog is by nature interactive, requiring the response of the participants to be both meaningful and timely. As with other perceptual components (e.g., speech recognition and natural language understanding), vision must be reliable and fast, relative to the tasks at hand. These constraints characterize “interactive-time” vision [2] routines, which have the following properties: • Fast This is defined by context; some visual events must be handled more rapidly than others. For example, interpreting user motion to control a pointing device needs to be done at a higher rate than interpreting a gesture to signal “goodbye”. • Low latency The total response time is more important than the processing rate (frames per second). Latency and speed requirements are constrained by the maximum acceptable delay in response to various visual events. This may vary among different scenarios (e.g., “power user” vs. entertainment application). • Task specific Routines should take advantage of known constraints which simplify the processing, such as a non-moving camera, static background scene, or consistent lighting conditions. Enumerating