Generating Robot/Agent Backchannels During a Storytelling
Experiment
S. Al Moubayed, M. Baklouti, M. Chetouani, T. Dutoit, A. Mahdhaoui,
J.-C. Martin, S. Ondas, C. Pelachaud, J. Urbain, M. Yilmaz
Abstract— This work presents the development of a real-
time framework for the research of Multimodal Feedback
of Robots/Talking Agents in the context of Human Robot
Interaction (HRI) and Human Computer Interaction (HCI). For
evaluating the framework, a Multimodal corpus is built (EN-
TERFACE STEAD), and a study on the important multimodal
features was done for building an active Robot/Agent listener of
a storytelling experience with Humans. The experiments show
that even when building the same reactive behavior models for
Robot and Talking Agents, the interpretation and the realization
of the behavior communicated is different due to the different
communicative channels Robots/Agents offer be it physical but
less human-like in Robots, and virtual but more expressive and
human-like in Talking agents.
I. INTRODUCTION
During the last years, several methods have been proposed
for the improvement of the interaction between humans
and talking agents or robots. The key idea of their design
is to develop agents/robots with various capabilities: es-
tablish/maintain interaction, show/perceive emotions, dialog,
display communicative gesture and gaze, exhibit distinctive
personality or learn/develop social capabilities [1], [2]. The-
ses social agents and robots aim at naturally interacting with
humans by the exploitation of these capabilities. In this paper,
we have investigated one aspect of this social interaction:
the engagement in the conversation [3]. The engagement
process makes it possible to regulate the interaction between
the human and the agent or the robot. This process is
obviously multi-modal (verbal and non-verbal) and requires
an involvement of both the partners.
This paper deals with two different interaction types
namely Human-Robot Interaction (HRI) with the Sony AIBO
robot and Human-Computer Interaction (HCI) with an Em-
bodied Conversational Agent (ECA). The term ECA has
S. Al Moubayed is with Center for Speech Technology, Royal Institute
of Technology KTH, SWEDEN sameram@kth.se
M. Baklouti is with the Thal` es, FRANCE
malek.baklouti@thalesgroup.com
M. Chetouani and A. Mahdhaoui are with the University Pierre
and Marie Curie, FRANCE mohamed.chetouani@upmc.fr,
Ammar.Mahdhaoui@isir.fr
T. Dutoit and J. Urbain are with the Facult´ e Polytechnique
de Mons, BELGIUM, thierry.dutoit@fpms.ac.be,
jerome.urbain@fpms.ac.be
J.-C. Martin is with the LIMSI, FRANCE martin@limsi.fr
S. Ondas is with the Technical University of Kosice, SLOVAKIA
stanislav.ondas@gmail.com
C. Pelachaud is with the INRIA, FRANCE
catherine.pelachaud@inria.fr
M. Yilmaz is with the Koc University, TURKEY
yilmazmehmetmustafa@gmail.com
been coined in Cassell et al. [4] and refers to human-
like virtual characters that typically engage in face-to-face
communication with the human user. We have used GRETA
[5], an ECA, whose interface obeys the SAIBA (Situation,
Agent, Intention, Behavior, Animation) architecture [6]. We
focused on the design of an open-source, real-time software
platform for designing the feedbacks provided by the robot
and the humanoid during the interaction
1
. The multimodal
feedback problem we considered here was limited to facial
and neck movements by the agent (while the AIBO robot
uses all possible body movements, given its poor facial
expressivity): we did not pay attention to arms or body
gestures.
This paper is organized as follows. In section II, we
present the storytelling experiment used for the design of our
human robot/agent interaction system described in section
III. Section IV focuses on speech and face analysis modules
we have developed. We then give in sections V and VI a
description of the multi-modal generation of backchannels
including interpretation of communicative signals and the
implemented reactive behaviors of the agent and the robot.
Finally, section VII presents the details of the evaluation and
comparison in our HCI and HRI systems.
II. FACE-TO-FACE STORYTELLING EXPERIMENT
A. Data collection
In order to model the interaction between the speaker and
the listener during a storytelling experiment, we first recorded
and annotated a database of human-human interaction termed
eNTERFACE STEAD. This database was used for extracting
feedback rules (section II-B) but also for testing the multi-
modal feature extraction system (section IV).
We followed the McNeill lab framework [7]: one par-
ticipant (the speaker), has previously observed an animated
cartoon (Sylvester and Tweety), retells the story to a listener
immediately. The narration is accompanied by spontaneous
communicative signals (filled pauses, gestures, facial expres-
sions...). 22 storytelling sessions were videotaped with dif-
ferent conditions: 4 languages (Arabic, French, Turkish and
Slovak). The videos have been annotated (with at least two
annotators per session) for describing simple communicative
signals of both speaker and listener: smile, head nod, head
shake, eye brow and acoustic prominence.
1
The database and the source code for the software developed dur-
ing the project are available online from the eNTERFACE08 web site:
www.enterface.net/enterface08.
2009 IEEE International Conference on Robotics and Automation
Kobe International Conference Center
Kobe, Japan, May 12-17, 2009
978-1-4244-2789-5/09/$25.00 ©2009 IEEE 3749