Example-based Training of Dialogue Planning Incorporating User and Situation Models Shinichi Ueno, Ian R. Lane, Tatsuya Kawahara School of Informatics, Kyoto University Yoshida-Hommachi, Sakyo-ku, Kyoto 606-8501, Japan {ian,kawahara}@ar.media.kyoto-u.ac.jp Abstract To provide a high level of usability, spoken dialogue systems must generate cooperative responses for a wide variety of users and situations. We introduce a dialogue planning scheme incor- porating user and situation models making such dialogue adap- tation possible. Manually developing a set of dialogue rules to account for all possible model combinations, would be very difficult and obstruct system portability. To overcome this prob- lem, we propose a novel example-based training scheme for di- alogue planning, where example dialogues from a role-playing simulation are collected and a machine learning approach is used to train the dialogue planner. The proposed scheme is evaluated on the Kyoto city voice portal, a multi-domain spoken dialogue system. Subjects participated in a role-playing simu- lation where they selected appropriate system responses at each dialogue turn based on a given scenario. Experimental results show that the system successfully trains the dialogue planner and provides reasonable system performance. 1. Introduction The continual improvement of speech recognition and mobile communication technologies has enabled the development of interactive voice response (IVR) systems that allow users to ob- tain a variety of information via mobile phone based voice in- terfaces. However, such systems are typically difficult to op- erate for non-experts, and do not provide cooperative dialogue. Whether a system is cooperative to a user depends on user char- acteristics, such as whether the user is a novice, or in a hurry, and other external factors including time of day. For a spo- ken dialogue system to interact cooperatively with a user, such information must be considered during dialogue planning and response generation. Previous research includes several methods to adapt dia- logue strategies based on various cues [1, 2, 3]. Factors used for adaptation include, user knowledge level in the target domain [4] and skill level using the system [5]. External information such as time of day and user location was incorporated in a mo- bile navigation system in [6]. These studies, however, typically focus on only single factors and modeling is generally task de- pendent. In order to generate truly cooperative responses, mul- tiple factors must be considered simultaneously during dialogue planning. In this paper, we present a comprehensive modeling scheme to generate user and situation-adapted responses for spoken di- alogue systems. As domain independent user characteristics, skill level to the system, degree of hastiness, and dialogue goal clarity are used and detected in real time. External factors in- cluding time of day, location of the place of interest, and exter- nal events that may affect the task are also taken into account. These models provide non-linguistic information that enables detailed user and situation specific dialogue plans to be gener- ated. The main problem in implementing a dialogue manage- ment scheme incorporating the above models is plan complex- ity. Manually generating an optimal set of dialogue rules to account for all possible model combinations would be very dif- ficult, and there is no guarantee that these rules would generate optimal dialogue flows. To overcome this problem we introduce an machine learning approach to dialogue planning. Training is done for each user by collecting data with a role-playing dia- logue, enabling a user adaptive dialogue planning system to be realized. 2. Kyoto City Voice Portal System To investigate the proposed planning approach, we have devel- oped the Kyoto city voice portal system, a multi-domain spo- ken dialogue system, which provides a spoken interface to three inter-related domains; Tourist domain: Information on tourist spots within Kyoto city, including operating hours, entrance fees, access methods, as well as information on festivals and special events that take place within the city. Restaurant domain: Information on restaurants within Kyoto city. The system allows users to search for restaurants by food category, area, and budget. Bus domain: Bus route and time-table information including real-time bus location. The system enables users to de- termine the correct bus to take between a given location and destination, and also provides information on how close the approaching bus is to the specified bus-stop. The domains are inter-related enabling users to search for restaurants near tourist spots, and providing bus access infor- mation for restaurants, tourist spots and other landmarks. 2.1. System Architecture An overview of the system is shown in Figure 1. VoiceXML scripts are generated dynamically by back-end dialogue agents based on the users’ response and relevant dialogue state in- formation. TTS and ASR engines are driven by the given VoiceXML script. The system contains two types of dialogue agents, a portal agent, and multiple domain agents: tourist, restaurant and bus. The portal agent controls the overall dialogue flow and regulates switching between domains, selecting the appropriate domain- agent for each user query. The portal agent also enables infor- INTERSPEECH 2004 - ICSLP 8 th International Conference on Spoken Language Processing ICC Jeju, Jeju Island, Korea October 4-8, 2004 ISCA Archive http://www.isca-speech.org/archive 10.21437/Interspeech.2004-731