Multimodal Meaning Representation for Generic Dialogue Systems Architectures Frédéric Landragin, Alexandre Denis, Annalisa Ricci, Laurent Romary LORIA – UMR 7503 Campus scientifique B.P. 239 F-54506 Vandoeuvre-lès-Nancy Cedex {landragi, denis, ricci, romary}@loria.fr Abstract An unified language for the communicative acts between agents is essential for the design of multi-agents architectures. Whatever the type of interaction (linguistic, multimodal, including particular aspects such as force feedback), whatever the type of application (command dialogue, request dialogue, database querying), the concepts are common and we need a generic meta-model. In order to tend towards task-independent systems, we need to clarify the modules parameterization procedures. In this paper, we focus on the characteristics of a meta-model designed to represent meaning in linguistic and multimodal applications. This meta-model is called MMIL for MultiModal Interface Language, and has first been specified in the framework of the IST MIAMM European project. What we want to test here is how relevant is MMIL for a completely different context (a different task, a different interaction type, a different linguistic domain). We detail the exploitation of MMIL in the framework of the IST OZONE European project, and we draw the conclusions on the role of MMIL in the parameterization of task-independent dialogue managers. Introduction The specification of a language that represents both the form and the content of linguistic resources is an important task in the design of dialogue systems archi- tectures. The more spontaneous and constraints-free is the natural language dialogue, the more complex are the form and the content of resources. When the dialogue is multi- modal, so when a gesture capture device is associated to the microphone with which the user interacts with the system, the language shall combine the capability of handling complex structures in the language resources with the generality and the flexibility required for operating as communication interface between the various modules. In a multimodal system, the main and classical modules are the following: speech recognizer, gesture recognizer, semantic analyzer, multimodal fusion, action planner, multimodal fission, speech synthesizer, visual feedback producer. The use of a representation language common for all communicative acts offers several advantages in terms of generality and parameterization. For instance, exchanges between all the previously mentioned modules will be represented using the same format and the same content description, and the particular application for which the system is instanced will parameterize the action planner using the same type of resource. In this paper, we present our experience in designing the MMIL (MultiModal Interface Language) language in the framework of the IST MIAMM European project (see also Kumar & Romary, 2002), and we describe the procedure while re-using it for the IST OZONE European project. In particular, we describe the MMIL specifications for the two demonstrators that were implemented during these projects, and the adaptations required for the management of new features. Among this new features: the manage- ment of salience, the status of secondary events in an user utterance, and the status of speech acts. On the basis of this experience, we draw some conclusions on the design of application-independent dialogue systems, and we discuss how MMIL is the object of a possible future standardization for the representation of multimodal semantic content. MMIL specifications in MIAMM The specification of MMIL in MIAMM copes the definition of a language able to uniformly represent semantic content in a multimodal context, so to capture: ß linguistic, gestural, and graphical events, ß both dialogue acts and dialogue act’s contents. Past or existing initiatives are often limited to specific modalities, while the aim of MMIL is to provide a meta- model for semantic representation free from any modality constraint. MMIL compared to other languages M3L (Multimodal Markup Language) was specified for the SmartKom project for the representation of information that flows between the various processing components (speech recognition, gesture recognition, face interpretation, media fusion, presentation planning, etc.). In particular, M3L represents all information about segmentation, synchronization and confidences in processing results. Its main strong point is in its large coverage. But, contrary to MMIL, there is no meta-model behind M3L, and its XML syntax is not so flexible. EMMA (Extensible MultiModal Annotation Markup Language) aims to represent information automatically extracted from a user’s input by an interpretation component. It’s a technical report from the W3C Multimodal Interaction working group, which main purpose is to develop specifications to enable access to the Web using multimodal interaction. This orientation makes this language very engineering-oriented. MPML (Multimodal Presentation Markup Language) was and is still designed for multimodal presentation using interactive life-like agents. Its purpose is to provide a means to write attractive presentations easily (Tsutsui et al., 2000). It illustrates the interest of description language