Conditional Sequence Model for Context-based Recognition of Gaze Aversion Louis-Philippe Morency and Trevor Darrell MIT Computer Science and Artiﬁcial Intelligence Laboratory Cambridge, MA 02139 {lmorency,trevor}@csail.mit.edu Abstract. Eye gaze and gesture form key conversational grounding cues that are used extensively in face-to-face interaction among people. To accurately recognize visual feedback during interaction, people often use contextual knowledge from previous and current events to anticipate when feedback is most likely to occur. In this paper, we investigate how dialog context from an embodied conversational agent (ECA) can im- prove visual recognition of eye gestures. We propose a new framework for contextual recognition based on Latent-Dynamic Conditional Ran- dom Field (LDCRF) models to learn the sub-structure and external dy- namics of contextual cues. Our experiments show that adding contextual information improves visual recognition of eye gestures and demonstrate that the LDCRF model for context-based recognition of gaze aversion gestures outperforms Support Vector Machines, Hidden Markov Models, and Conditional Random Fields. Key words: Contextual information, Conditional Random Fields, Eye gesture recognition, gaze aversion 1 Introduction In face to face interaction, eye gaze is known to be an important aspect of discourse and turn-taking. To create eﬀective conversational human-computer interfaces, it is desirable to have computers which can sense a user’s gaze and infer appropriate conversational cues. Embodied conversational agents, either in robotic form or implemented as virtual avatars, have the ability to demonstrate conversational gestures through eye gaze and body gesture, and should also be able to perceive similar displays as expressed by a human user. Previous work has shown that human participants avert their gaze (i.e. per- form “look-away” or “thinking” gestures) to hold the conversational ﬂoor even while answering relatively simple questions [1]. A gaze aversion gesture while a person is thinking may indicate that the person is not ﬁnished with their con- versational turn. If the ECA senses the aversion gesture, it can correctly wait for mutual gaze to be re-established before taking its turn. When recognizing visual feedback, people use more than their visual percep- tion. Knowledge about the current topic and expectations from previous utter- ances help guide our visual perception in recognizing nonverbal cues. Context