A Survey of Affect Recognition Methods: Audio, Visual and Spontaneous Expressions Zhihong Zeng 1 , Maja Pantic 2 , Glenn I. Roisman 1 and Thomas S. Huang 1 1 University of Illinois at Urbana-Champaign, USA 2 Imperial College London, UK / University of Twente, Netherlands {zhzeng,huang}@ifp.uiuc.edu, m.pantic@imperial.ac.uk, roisman@uiuc.edu ABSTRACT Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. Promising approaches have been reported, including automatic methods for facial and vocal affect recognition. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions--despite the fact that deliberate behavior differs in visual and audio expressions from spontaneously occurring behavior. Recently efforts to develop algorithms that can process naturally occurring human affective behavior have emerged. This paper surveys these efforts. We first discuss human emotion perception from a psychological perspective. Next, we examine the available approaches to solving the problem of machine understanding of human affective behavior occurring in real-world settings. We finally outline some scientific and engineering challenges for advancing human affect sensing technology. Categories and Subject Descriptors A.1 [Introduction and Survey] H.1.2 [User/Machine Systems]: Human information processing H.5.1 [Multimedia Information Systems]: Evaluation/ metho- dology I.5.4 [Pattern Recognition Applications] General Terms Algorithms, Performance. Keywords Multimodal human computer interaction, multimodal user interfaces, affective computing, human computing, affect recognition, emotion recognition. 1. INTRODUCTION A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. Consequently, the future “ubiquitous computing” environments will need to have human-centered designs instead of computer- centered designs [15], [20], [57], [63], [64]. A change in the user’s affective state is a fundamental component of human- human communication. Some affective states motivate human actions and others enrich meaning of human communication. Consequently, the traditional HCI that ignores the user’s affective states filters out a large portion of the information available in the interaction process. Human Computing paradigm suggests that user interfaces of the future need to be proactive and human- centered, based on naturally occurring multimodal human communication [57]. More specifically, human-centered interfaces must have the ability to detect subtleties of and changes in the user's behavior, especially his or her affective behavior, and to initiate interactions based on this information, rather than simply responding to the user’s commands. Fig 1 illustrates a prototype of such an affect-sensitive, multimodal computer-aided learning system. The system was built during the NSF ITR project titled “Multimodal Human Computer Interaction: Toward a Proactive Computer” 1 . In this learning environment, the user explores Lego gear games by interacting with a computer avatar. Multiple sensors are used to detect and track the user’s behavioral cues and his or her task. More specifically, the useful information recognized from these sensors includes the user’s emotional state, engagement state, the utilized speech keywords, and the gear state. Based on this information, the avatar offers an appropriate tutoring strategy in this interactive learning environment. Other examples of affect- sensitive, multimodal HCI systems include the system of Duric et al. [22], which applies a model of embodied cognition that can be seen as a detailed mapping between the user’s affective states and the types of interface adaptations, and the proactive HCI tool of Maat and Pantic [51] capable of learning the user’s context- dependent behavioral patterns from multi-sensory data and of adapting the interaction accordingly, and the automated Learning Companion of Kapoor et al. [43] that combines information from cameras, a sensing chair and mouse, and wireless skin sensor to detect frustration in order to predict when the user need help. These systems demonstrate a rough picture of future multimodal human-computer interaction. Except in standard HCI scenarios, potential commercial applications of automatic human affect recognition include affect- sensitive systems for customer services, call centers [46], intelligent automobile system [40], and game and entertainment industry. These systems will change the nature of human- computer interaction in our daily lives. Another important 1 http://itr.beckman.uiuc.edu Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’07, November 12–15, 2007, Nagoya, Aichi, Japan. Copyright 2007 ACM 978-1-59593-817-6/07/0011...$5.00. 126