International Symposium on Tonal Aspects of Languages: Emphasis on Tone Languages, pp. 225-228, Mar. 28-30, 2004, Beijing, China 225 Context Directed Speech Recognition in Dialogue Systems Pengju YAN 1 and Fang ZHENG 2 Center for Speech Technology, State Key Laboratory of Intelligent Technology and Systems Department of Computer Science & Technology, Tsinghua University, Beijing, 100084, China {yan, fzheng}@cst.cs.tsinghua.edu.cn, http://cst.cs.tsinghua.edu.cn 1 The author is now with Panasonic Beijing Laboratory. mailto:\\yanpj@cmrd.panasonic.com.cn 2 The author is also with Beijing d-Ear Technologies Co., Ltd. mailto:\\fzheng@d-Ear.com Abstract For the time being, research and development on spoken dialogue systems (SDSs) become more and more important with the constant increasing demands. However, defects of the recognition strategies adopted for the sake of only laboratory demos are revealed evidently under the real-world circumstances, with the utterances of the casual style instead of the declamatory one. Both the shortage of domain-specific corpus and the existence of other empirical/heuristic knowledge appeal new methods to improve the recognition performance. Here we present a recognition framework where the dialogue contexts (DCs) are incorporated in as a restrictive source. Firstly, the idea of a focus expected (FE) under certain dialogue states is introduced. Secondly, the adaptation of lexicon and grammar rules is proposed. Finally, the recognition automaton generation under a specific FE is put forward. Experiments are carried out in the dialogue system EasyFlight, and the results show the effectiveness of the strategies. 1. Introduction In a somewhat inaccurate way, the term spoken dialogue system (or dialogue system in brief) can be defined as an automatic service providing system via the speech interaction I/O interface to people. Just as that implies, a dialogue system normally consists of four functional components which are a speech recognizer, a language parser, a dialogue manager and a speech synthesizer. Differing from in-laboratory speech systems, the main goal of a dialogue system is to achieve a real-world pragmatic task, e.g. to find out a best route to a site, or to book an air ticket. Thus the understanding performance becomes the most cared issue that researchers focus on. Measures have been taken to counter the spontaneousness/ casualness of the spoken utterances in dialogue systems. At the pure acoustic level, hybrid pronunciation modeling is proposed to deal with rich pronunciation variants and notable co-articulations, and primary but encouraging progress has been made, such as given in [1]. At the pure linguistic level, a robust understanding scheme which models repeat, word disordering, fragment, ellipse, and ill form is present in [2], and the grammar coverage is proved to be sufficiently large against all of those ungrammaticalities. However, the recognition strategy itself still plays a bottleneck role in the whole scene. Generally speaking, there are four kinds of speech recognition strategies appearing by now that can be used in dialogue systems. The first and the simplest way is the isolated word recognition. Owing to the high recognition rate, it can be used in crucial situations where even the least errors are not tolerable, but shows very low user-friendliness. The second is the keyword spotting, where the main idea is to highlight the task- concerned words comparing with the unconcerned ones by means of using various weights [3]. One hybrid is the so-called sliding- window word spotting, where the search process can start at anywhere in the speech [4]. The main disadvantage of them is that any other knowledge can only be adopted as a confidence- measure, thus produces low recognition rate. The third is the template based matching where the input utterances are explicitly restricted in the search graph [4]. Even if the network is expanded at arcs by altering words within the same semantic class, the performance is rather low against unpredicted utterances. The fourth is the stochastic n-gram based recognition. Considering the short of sufficient training corpus, unified language models integrating n-gram and grammar rules have been put forward [5]. With the perplexity considerably dropped, unfortunately, the word error rate stays almost unchanged. In this paper, a context directed recognition strategy is proposed, where the dialogue context knowledge and semantic knowledge, what the previous methods neglect, are made best of to instruct the search process. The main idea is to predict the information the next turn will be involved in and then restrict the search process to the predetermined word network. The strategy can be depicted as follows. At first, a focus expected is introduced in each dialogue turn to reflect the current dialogue inner status, which is the function of the history/context. Next, given a specific FE, a rule set is dynamically chosen according to the offline semantic label. Once the rule set related to each FE satisfies some condition, it can be converted to a finite state network (FSN) with its arcs associated with words. Finally, the recognizer produces the ultimate results by searching through the given FSN. This strategy is tried on an air travel information