How do non-expert users exploit simultaneous inputs in multimodal interaction? Knut Kvale, John Rugelbak and Ingunn Amdal 1 Telenor R&D, Norway knut.kvale@telenor.com , , john.rugelbak@telenor.com ingunn.amdal@tele.ntnu.no Abstract This paper evaluates scenario based user tests of speech-centric multimodal interaction on a small mobile terminal. Non-expert users solved tasks in the tourist guide domain using a functional multimodal PDA-based application. The tasks required pen and speech input, but the users were free to choose either sequential or simultaneous pen and speech input at each step in the dialogue. Multimodal interfaces are still a novelty to most users so we had to explain this functionality to the test users. The format of the introduction had a noticeable effect on user behaviour. Users who had seen a video demonstration used simultaneous pen and speech input more often than the users who had had a text only introduction even if the same information was present in both formats. 9 of 14 subjects who had seen the video demo, applied simultaneous pen and speech input instantly. We therefore claim that people will use simultaneous multimodal input when they have been properly introduced to this functionality. However, simultaneous use of pen and speech may impose an extra cognitive load, at least until people get familiar with this kind of interface. The users considered the multimodal interaction attractive and expressed that they enjoyed the freedom of choosing input mode at each step in the dialogue. Key words: Multimodal interaction, small mobile terminals, user behaviour 1. Introduction Multimodal human-computer interfaces give the user the opportunity to choose the most natural interaction pattern depending on context and application. Multiple input and output modalities can be combined in several ways. Here we distinguish between combining the multimodal inputs sequentially or simultaneously. Systems allowing either mode have several parallel input channels active at the same time. In a sequential multimodal system only one of the input channels is interpreted (e.g. the first input at each dialogue stage). In a simultaneous multimodal system all inputs within a given time window are interpreted jointly depending on the fusion of the partial information from the different input channels. In interaction between humans simultaneous input is natural, but it is by far the most complicated scenario to implement for human-computer interaction, especially in mobile terminals. Before huge investments are spent on implementing simultaneous pen and speech functionality in future mobile terminals and networks, we need to study how people actually interact with a mobile terminal. In this paper we report experiments on a system that allows simultaneous multimodal input, to study how users exploit simultaneous pen and speech 1 Ingunn Amdal is currently at The Norwegian University of Science and Technology