Semi-Automatically Generated High-Level Fusion for Multimodal User Interfaces Dominik Ertl, Sevan Kavaldjian, Hermann Kaindl, J¨ urgen Falb Vienna University of Technology, Institute of Computer Technology A–1040 Vienna, Austria {ertl, kavaldjian, kaindl, falb}@ict.tuwien.ac.at Abstract Reliable high-level fusion of several input modalities is hard to achieve, and (semi-)automatically generating it is even more difﬁcult. However, it is important to address in order to broaden the scope of providing user interfaces semi-automatically. Our approach starts from a high-level discourse model created by a human interaction designer. It is modality-independent, so an annotated discourse is semi- automatically generated, which inﬂuences the fusion mech- anism. Our high-level fusion checks hypotheses from the various input modalities by use of ﬁnite state machines. These are modality-independent, and they are automatically generated from the given discourse model. Taking all this together, our approach provides semi-automatic generation of high-level fusion. It currently supports input modalities graphical user interface, (simple) speech, a few hand ges- tures, and a bar code reader. 1 Introduction While semi-automatic generation of user interfaces is still primarily a matter of research, the approaches are be- coming increasingly mature. Our own approach is based on high-level discourse modeling, and its focus has been on graphical user interfaces (GUIs), more precisely WIMP (window, icon, menu, pointer) interfaces [4, 9, 15]. More recently, we extended this approach to multimodal inter- faces. Managing the input from several modalities requires high-level fusion. This is a challenging issue on its own, but semi-automatic generation makes it even harder. Still, we can generate ﬁnite state machines (FSMs) — for check- ing input hypotheses from the various modalities — from our discourse models. The results of these checks are com- municative acts that provide an abstract representation of what is believed to be the input. Our overall generation process from high-level dis- courses to multimodal user interfaces is a further devel- opment of the one presented in [3], including the new generation of FSMs (see Figure 1). First, the modality- independent discourse is modeled by an interaction de- signer. Then, the FSMs are generated, and a task-level transformation leads to an annotated discourse. At last, the annotated discourse is rendered into the ﬁnal modalities. As a running example, we use a small part of a multi- modal user interface of a research robot shopping cart, that is a non-trivial application of our approach. We use this ex- ample to show the semi-automatic generation of the fusion mechanism and its runtime behavior. The remainder of this paper is organized in the follow- ing manner. First, we provide some background on high- level fusion for multimodal user interfaces and our ap- proach to interaction design, in order to make this paper self-contained. Then we explain the concept of a modality provider and sketch how the modalities that we currently use for input are fed into the fusion mechanism. After that, we present our annotated discourse model and its semi- automatic generation. Based on that, the core of the paper is dedicated to explain our new approach to semi-automatic generation of fusion based on ﬁnite state machines. Finally, we discuss the fulﬁllment of the CARE properties [2] by our approach. 2 State of the Art and Background In this section, we sketch some background information about high-level fusion mechanisms for multimodal user in- terfaces, and our interaction design approach. 2.1 High-level Fusion for Multimodal User Interfaces A multimodal user interface offers more than one modal- ity (e.g., speech and GUI) to the user. In order to process 1 Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010 978-0-7695-3869-3/10 $26.00 © 2010 IEEE