Simulating recognition errors in speech user interface prototyping Matthias Peissner, Frank Heidmann, Jürgen Ziegler Fraunhofer-Institute for Industrial Engineering (IAO), Nobelstr. 12, D-70569 Stuttgart, Germany ABSTRACT We have developed a Wizard of Oz simulation tool which allows scenario-based simulation of speech systems for the conduction of empirical studies with future users. This paper focuses on the adequate integration of recognition errors as they are an important feature of speech-based applications. The presented solution considers the aspects of reliability and validity. Both are necessary preconditions for the immediate transferability of simulation results to the real system. 1. SPEECH USER INTERFACE PROTOTYPING In the field of GUI design it has become common practice to test usability in early development stages. By using paper prototypes important design decisions can be met on the empirical basis of tests with future users. In comparison to the vast amount of empirical studies and guidelines concerning the usability of GUIs, we know very little about how to design effective speech user interfaces (SUI). Moreover SUI designers face the essential difficulty of getting a sound feeling for the dialogue flow by merely inspecting a written dialogue specification. For these reasons it is even more important to include prototyping and usability testing early in the design process of user-friendly interactive voice response systems (IVR systems). The speech equivalent to a paper prototype is a Wizard of Oz (WOZ) study (Weinschenk & Barker, 2000), where a human (the wizard) simulates the role of the computer during testing and starts different recorded system prompts dependent on what the user said. Usability testing with the WOZ technique can lead to valuable results regarding the following topics: Å Designing a user-oriented grammar: In very early development stages WOZ studies can pinpoint the utterances which are typically used in order to control the available functions. Given a sufficient number of subjects the transcriptions of the test sessions can give a representative image of how users would expect the system to understand. The most frequently recorded utterances can serve as a valid basis for a user-centred grammar. This way, the time- consuming procedure of pilot testing including iterative grammar modifications and recognition tuning can be shortened or even partially avoided (Pearl, 2000). Å Comparison of different systems / system versions: Alternative design decisions can quickly be acted out and tested with future users. Especially the different effects of alternative prompt versions on the users’ performance and attitudes towards the system can be evaluated. Å Overall ergonomic evaluation: WOZ experiments can take the traditional role of usability tests in evaluation and troubleshooting. The detection of major problems of use in an early development stage enables iterative redesign and reconception without the otherwise necessary phases of implementation. Necessary precondition for the validity of a WOZ study is that the interaction between user and “machine” (here the wizard) has to be as realistic as possible. Otherwise, the gained results cannot be transferred immediately to the real situation of system use. This means, that on one hand, the subject in a WOZ study must actually belief that she interacts with a real system, which is a matter of adequate instruction. On the other hand, the simulation must not differ from the specified system behaviour in essential aspects. Among others, this refers to the reliability of speech recognition which is treated in detail in the following section, and to the available complexity of the dialogue. With high complexity applications it is necessary to do scenario-based testing in order to reduce the amount of probable user utterances. This supports the wizard’s decision by giving a situation specific pre-selection of probable options for “system” reactions. 2. PROBLEM Speech technology is probabilistic in nature and therefore recognition errors are inherent in any speech-based application. Furthermore, situations of recognition errors are especially crucial to usability variables such as effectiveness and efficiency in task solving and user acceptance (Yankelovich, Levow & Marx, 1995).