Generating Robot/Agent Backchannels During a Storytelling Experiment S. Al Moubayed, M. Baklouti, M. Chetouani, T. Dutoit, A. Mahdhaoui, J.-C. Martin, S. Ondas, C. Pelachaud, J. Urbain, M. Yilmaz Abstract— This work presents the development of a real- time framework for the research of Multimodal Feedback of Robots/Talking Agents in the context of Human Robot Interaction (HRI) and Human Computer Interaction (HCI). For evaluating the framework, a Multimodal corpus is built (EN- TERFACE STEAD), and a study on the important multimodal features was done for building an active Robot/Agent listener of a storytelling experience with Humans. The experiments show that even when building the same reactive behavior models for Robot and Talking Agents, the interpretation and the realization of the behavior communicated is different due to the different communicative channels Robots/Agents offer be it physical but less human-like in Robots, and virtual but more expressive and human-like in Talking agents. I. INTRODUCTION During the last years, several methods have been proposed for the improvement of the interaction between humans and talking agents or robots. The key idea of their design is to develop agents/robots with various capabilities: es- tablish/maintain interaction, show/perceive emotions, dialog, display communicative gesture and gaze, exhibit distinctive personality or learn/develop social capabilities [1], [2]. The- ses social agents and robots aim at naturally interacting with humans by the exploitation of these capabilities. In this paper, we have investigated one aspect of this social interaction: the engagement in the conversation [3]. The engagement process makes it possible to regulate the interaction between the human and the agent or the robot. This process is obviously multi-modal (verbal and non-verbal) and requires an involvement of both the partners. This paper deals with two different interaction types namely Human-Robot Interaction (HRI) with the Sony AIBO robot and Human-Computer Interaction (HCI) with an Em- bodied Conversational Agent (ECA). The term ECA has S. Al Moubayed is with Center for Speech Technology, Royal Institute of Technology KTH, SWEDEN sameram@kth.se M. Baklouti is with the Thal` es, FRANCE malek.baklouti@thalesgroup.com M. Chetouani and A. Mahdhaoui are with the University Pierre and Marie Curie, FRANCE mohamed.chetouani@upmc.fr, Ammar.Mahdhaoui@isir.fr T. Dutoit and J. Urbain are with the Facult´ e Polytechnique de Mons, BELGIUM, thierry.dutoit@fpms.ac.be, jerome.urbain@fpms.ac.be J.-C. Martin is with the LIMSI, FRANCE martin@limsi.fr S. Ondas is with the Technical University of Kosice, SLOVAKIA stanislav.ondas@gmail.com C. Pelachaud is with the INRIA, FRANCE catherine.pelachaud@inria.fr M. Yilmaz is with the Koc University, TURKEY yilmazmehmetmustafa@gmail.com been coined in Cassell et al. [4] and refers to human- like virtual characters that typically engage in face-to-face communication with the human user. We have used GRETA [5], an ECA, whose interface obeys the SAIBA (Situation, Agent, Intention, Behavior, Animation) architecture [6]. We focused on the design of an open-source, real-time software platform for designing the feedbacks provided by the robot and the humanoid during the interaction 1 . The multimodal feedback problem we considered here was limited to facial and neck movements by the agent (while the AIBO robot uses all possible body movements, given its poor facial expressivity): we did not pay attention to arms or body gestures. This paper is organized as follows. In section II, we present the storytelling experiment used for the design of our human robot/agent interaction system described in section III. Section IV focuses on speech and face analysis modules we have developed. We then give in sections V and VI a description of the multi-modal generation of backchannels including interpretation of communicative signals and the implemented reactive behaviors of the agent and the robot. Finally, section VII presents the details of the evaluation and comparison in our HCI and HRI systems. II. FACE-TO-FACE STORYTELLING EXPERIMENT A. Data collection In order to model the interaction between the speaker and the listener during a storytelling experiment, we ﬁrst recorded and annotated a database of human-human interaction termed eNTERFACE STEAD. This database was used for extracting feedback rules (section II-B) but also for testing the multi- modal feature extraction system (section IV). We followed the McNeill lab framework [7]: one par- ticipant (the speaker), has previously observed an animated cartoon (Sylvester and Tweety), retells the story to a listener immediately. The narration is accompanied by spontaneous communicative signals (ﬁlled pauses, gestures, facial expres- sions...). 22 storytelling sessions were videotaped with dif- ferent conditions: 4 languages (Arabic, French, Turkish and Slovak). The videos have been annotated (with at least two annotators per session) for describing simple communicative signals of both speaker and listener: smile, head nod, head shake, eye brow and acoustic prominence. 1 The database and the source code for the software developed dur- ing the project are available online from the eNTERFACE08 web site: www.enterface.net/enterface08. 2009 IEEE International Conference on Robotics and Automation Kobe International Conference Center Kobe, Japan, May 12-17, 2009 978-1-4244-2789-5/09/$25.00 ©2009 IEEE 3749