Multimodal Human Computer Interaction: A Survey Alejandro Jaimes *,1 and Nicu Sebe & * IDIAP, Switzerland ajaimes@ee.columbia.edu & University of Amsterdam, The Netherlands nicu@science.uva.nl Abstract. In this paper we review the major approaches to Multimodal Hu- man Computer Interaction, giving an overview of the field from a computer vi- sion perspective. In particular, we focus on body, gesture, gaze, and affective interaction (facial expression recognition and emotion in audio). We discuss user and task modeling, and multimodal fusion, highlighting challenges, open issues, and emerging applications for Multimodal Human Computer Interaction (MMHCI) research. 1 Introduction Multimodal Human Computer Interaction (MMHCI) lies at the crossroads of several research areas including computer vision, psychology, artificial intelligence, and many others. We study MMHCI to determine how we can make computer tech- nology more usable by people, which invariably requires the understanding of at least three things: the user who interacts with it, the system (the computer technology and its usability), and the interaction between the user and the system. By considering these aspects, it is obvious that MMHCI is a multi-disciplinary subject since the de- signer of an interactive system should have expertise in a range of topics: psychology and cognitive science to understand the user’s perceptual, cognitive, and problem solving skills, sociology to understand the wider context of interaction, ergonomics to understand the user’s physical capabilities, graphic design to produce effective inter- face presentation, computer science and engineering to be able to build the necessary technology, etc. The multidisciplinary nature of MMHCI motivates our approach to this survey. Instead of focusing only on Computer Vision techniques for MMHCI, we give a gen- eral overview of the field, discussing the major approaches and issues in MMHCI from a computer vision perspective. Our contribution, therefore, is giving researchers in Computer Vision or any other area who are interested in MMHCI a broad view of the state of the art and outlining opportunities and challenges in this exciting area. 1.1. Motivation In human-human communication, interpreting the mix of audio-visual signals is essential in communicating. Researchers in many fields recognize this, and thanks to 1 This work was performed while Alejandro Jaimes was with FXPAL Japan, Fuji Xerox Co., Ltd.