THE USE OF PROSODIC FEATURES TO HELP USERS EXTRACT INFORMATION FROM STRUCTURED ELEMENTS IN SPOKEN DIALOGUE SYSTEMS Jaakko Hakulinen, Markku Turunen, and Kari-Jouko Räihä Human-Computer Interaction Group, Department of Computer Science, University of Tampere, P.O. Box 607, FIN-33101 Tampere, Finland. Tel. +358 3 2156952, FAX +358 3 2158557, E-mail: {jh, mturunen, kjr}@cs.uta.fi ABSTRACT Most of the previous research on speech user interfaces has focused on what information should be presented to the user. Equally important is the question of how this information should be presented. Although speech synthesis is quite intelligible in well-formed and simple sentences, it may be very difficult to understand when complex structural elements, like tables or URLs, are spoken. We arranged a controlled experiment to identify the prosodic features that affect the intelligibility and pleasantness of synthetic speech. Pauses were found to make a significant difference in comprehension. Good variation in pitch and rate seem to make a voice more pleasant to listen to but have only minor positive effect on comprehension. We analyzed the exact ways in which human readers used prosodic elements so that we could construct unique and human like computer ‘persons’ for spoken dialogue applications. 1. INTRODUCTION Speech output is widely used in many computer applications. Telephone applications, mainly interactive voice response systems (IVRs), have been very successful. These applications mainly use real speech recorded by professional speakers. However, it is not always possible to use prerecorded speech. An alternative is to use synthetic speech. However, it is often argued that although current speech synthesizers can produce very understandable sentences, most people do not like the way in which they are expressed. Indeed, synthetic speech sounds very monotonous when compared to normal human speech. Synthesized speech usually lacks the prosodic features that make human speech sound lively. In everyday communication we all make considerable use of prosodic elements like pitch and volume for emphasis. Prosodic information also conveys information that cannot be obtained in any other way. For example, when we change emphasis from one word to another, it is possible that the meaning of a sentence changes dramatically. Furthermore, if complex elements like lists, tables and addresses are produced using speech synthesis, the use of prosody is essential. Such verbal representations may be very hard to understand, even in human-to-human communication. The most important prosodic features found in human speech are pitch, volume, rate and pauses. Current speech synthesizers allow reasonable control of these parameters. Therefore these prosodic features could be utilized in speech user interfaces [1]. In our case, the motivation was a speech interface (in Finnish) to an e-mail client that we have been developing. In order to find new ways to use prosody we arranged an experiment in which human speakers recorded a set of utterances. The same sentences were also produced using a speech synthesizer. A group of listeners heard those utterances and answered questions about them. We found that some prosodic features seemed to increase intelligibility of speech while others made speech more pleasant to listen to. We analyzed the exact ways in which human readers used their voices. We believe that by bringing these methods to synthetic speech we could increase both its intelligibility and pleasantness. In the rest of this paper we first will propose how prosody could be supported in speech applications. We then describe the experiment and its results. Finally, conclusions from the experiment are drawn and ideas for future work are presented. 2. SUPPORTING PROSODY IN SPOKEN LANGUAGE APPLICATIONS Previous research in speech user interfaces has focused mainly on prompt design [6], dialogue management issues like navigation in lists and menus [4] and on dialogue management strategies [7]. In general, most of the previous research on speech output has studied what information should be expressed in system utterances. We wanted to examine how system utterances could be expressed more efficiently by adding prosodic features. In our e-mail application, three kinds of utterances are spoken to the user: system utterances that are part of dialogue manage- ment, descriptions of sets of messages (“views”), and the messages themselves. The first case is the easiest one because we know in advance what these utterances are. In the second case things get more complicated since we have to deal with information that is not known in advance. However, the structure of information is still fixed. The third case is the most challenging, since information is totally unconstrained. To improve the quality of speech output by using prosodic features, we could add control codes in messages when messages or their structure are known in advance. In this way messages could be fine-tuned by hand. However, this approach is not possible in all cases. Furthermore, in order to be natural and efficient the style of speech should not be static. Instead it