Algorithms for Controlling Cooperation between Output Modalities in 2D Embodied Conversational Agents Sarkis Abrilian * , Jean-Claude Martin * † and Stéphanie Buisine * * LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France, +33.1.69.85.81.04 † LINC-Univ Paris 8, IUT de Montreuil, 140 rue de la Nouvelle France, 93100 Montreuil, France {sarkis,martin,buisine}@limsi.fr ABSTRACT Recent advances in the specification of the multimodal behavior of Embodied Conversational Agents (ECA) have proposed a direct and deterministic one-step mapping from high-level specifications of dialog state or agent emotion onto low-level specifications of the multimodal behavior to be displayed by the agent (e.g. facial expression, gestures, vocal utterance). The difference of abstraction between these two levels of specification makes difficult the definition of such a complex mapping. In this paper we propose an intermediate level of specification based on combinations between modalities (e.g. redundancy, complementarity). We explain how such intermediate level specifications can be described using XML in the case of deictic expressions. We define algorithms for parsing such descriptions and generating the corresponding multimodal behavior of 2D cartoon-like conversational agents. Some random selection has been introduced in these algorithms in order to induce some “natural variations” in the agent’s behavior. We conclude on the usefulness of this approach for the design of ECA. Categories and Subject Descriptors H.5.2-H.5.1 [Information Interfaces and Presentation]: User Interface – interaction styles, standardization, ergonomics, user interface management systems. Multimedia Information Systems. General Terms Algorithms, Human Factors, Languages. Keywords Multimodal output, Embodied Conversational Agent, Specification, redundancy. 1. INTRODUCTION Amongst multimodal output interfaces, Embodied Conversational Agents (ECA) seem to be promising for the intuitiveness and richness of Human-Computer Interaction. Advances in the specification of the multimodal behavior of ECA have mostly proposed direct one-step mappings from high-level specifications of dialog state or agent emotion onto low-level specifications of the multimodal behavior to be displayed by the agent (e.g. facial expression, gestures, vocal utterance). For example, the SAFIRA project [1] proposes a dual top-down approach via the Character Mark-up Language (from personality, emotion and behavior to animation) and bottom-up approach via the Avatar Mark-up Language (selection and synchronized merging of animations). The NECA system [6] generates the interaction between two or more characters in a number of steps, with the information flow proceeding from a Scene Generator to a Multi-modal Natural Language Generator, to a Speech Synthesis component, to a Gesture Assignment component, and finally to a media player. Thus a representation language was defined as a means for representing the various kinds of expert knowledge required at the different interfaces between the components. Other XML based specification language for ECA include VHML[4], MPML[7], APML[3]. All these languages propose mappings between a rather “high level” of abstraction and a “low level” of abstraction (e.g. translating a “happy” tag into corresponding animations of facial expressions and prosodic parameters for speech synthesis). The difference of abstraction between these two levels of specification makes difficult the definition of such a complex mapping which is indeed a key issue in the design of “believable” ECA. One potential dimension of cooperation between modalities, which is not considered in such specification languages is the degree of redundancy vs. complementarity between signals conveyed by several modalities for rendering different emotional states or communicative act strengths. Moreover, in most systems this mapping is deterministic. That makes the agent always react exactly in the same way to a given situation. Such a behavior might appear consistent but quite unnatural when compared to the complexity of human communication and reactions. Section 2 describes the 2D agent technology we use. In section 3, we define algorithms for parsing such “intermediate level” descriptions and generating the corresponding multimodal behavior of 2D cartoon-like conversational agents in the case of classical referring expressions. Some random selection has been introduced in these algorithms in order to induce some "natural variations" in the agent's behavior. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI'03, November 5-7, 2003, Vancouver, British Columbia, Canada. Copyright 2003 ACM 1-58113-621-8/03/0011…$5.00. 2. LOW-LEVEL SPECIFICATION We use 2D cartoon-like agents developed in Java. A catalogue of images representing several configurations of each body part has been designed. We present below a low-level specification of a