MULTIMODAL HIGH LEVEL FUSION OF INPUT COMMANDS AS A SEMANTIC GOAL-ORIENTED COOPERATIVE PROCESS Kosta Gaitanis, Olga Vybornova, Monica Gemo, Benoit Macq Université catholique de Louvain Communications and Remote Sensing Department Place du Levant 2, Louvain-la-Neuve, Belgium ABSTRACT This paper presents a generic non-deterministic approach for the integration of multimodal semantically grounded input commands, which we call multimodal high level fusion. The final goal is to develop a reference fusion framework of mul- tiple modalities for enhanced natural interaction. Key to this approach is the representation of each modality as an agent participating in a global team process whose goal is effective and natural execution of interaction. To make it flexible and widely applicable, modalities and the fusion process are mod- elled in an ontology expressing the domain constraints. Such ontologies provide contextual information to the bayesian net- works used to analyse and combine the modalities of interest. Here we show that bayesian intention planning is a sound, elegant, yet practical way to handle the fusion process and apply it to the correlation of speech and pointing gestures in the case of object manipulation. 1. INTRODUCTION Traditional interfaces with standard modalities might not be always accessible to everybody (e.g. disabled/impaired users). More natural and expressive computer control can be achieved through multimodal interaction with voice and gesture com- mands that better match the habitual communication skills of human beings. The central point in multimodal interaction is the modality fusion component that integrates mono-modal interpretations of multiple modalities into a single semantic representation of user’s actions. An increasing huge range of atomic and combined modalities are becoming available thanks to advances in natural language processing, computer vision, 3-D sound, gesture recognition as well as new interac- tion paradigms such as perceptual user interfaces (UI), tangi- ble user UIs and embodied UIs. Although many multimodal fusion systems have been built, their development is mainly restricted to specific cases (e.g. [1], [2]) and are not adequate to the increasing complexity of the new modalities (e.g. [3]). This work is supported by the SIMILAR network of excellence of the European Sixth Framework Programme (FP6-2002-IST1-507609). More specifically the fusion component must support the fol- lowing functionalities: 1) assignment of role and weight to each modality; 2) coordination of modalities; 3) analysis of single modalities at required representational levels (i.e artic- ulatory, syntactic, semantic levels). In this work we present a generic non-deterministic approach that tries to address all the above points in a flexible and extensible way. Modality fusion is defined by a probabilistic M-AHMEM (MultiAgent Abstract Hidden Markov mEmory Model) goal-oriented co- operative plan for natural execution of interaction that can be decomposed at different levels of abstraction. Each in- tervening modality acts as an agent of the global team pro- cess and is described in an ontology expressing the domain constraints. Such ontologies provide contextual information to the bayesian networks used to analyse and combine the modalities of interest. We first outline the relevant multi- modal fusion systems, before illustrating our proposed model in Section 3 and the way we apply it to object manipulation controlled by speech and gesture input commands in Section 4. The final section discusses concluding remarks. 2. RELATED WORK Many different systems for multimodal fusion are known in the literature and most of them rely on a deterministic rule- based mechanism to perform modality fusion [3, 2, 4, 1]. To achieve necessary robustness in multimodal input interaction, they limit their scope to restricted domains and modalities employed. The major examples concern multimodal dialogue systems. These can deal with combined usage of speech and 3D gestures [2], conversation interpretation of merged speech and gesture input [1], context integration in single and com- bined speech and pen-input modalities [4]. The ICARE en- vironment [3] for multimodal interaction design based on the complementarity, assignment, redundancy, equivalence con- ceptual properties of modalities is a step towards more generic modality composition, however the fusion mechanism is also deterministic, rule-based and it is not easily applicable to the deal with the new modalities (perceptual, tangible, embodied interaction).