Detection of Fillers Using Prosodic Features in Spontaneous Speech Recognition of Japanese Keikichi Hirose 1 Yu Abe 2 & Nobuaki Minematsu 2 1 Dept. of Inf. and Commu. Engg, School of Inf. Science and Tech. 2 Dept. of Frontier Informatics, School of Frontier Sciences University of Tokyo, Tokyo, Japan {hirose, yu-abe, mine}@gavo.t.u-tokyo.ac.jp Abstract A new scheme of detecting fillers in spontaneous speech recognition process was developed. When a filler hypothesis appears during the 2 nd pass decoding of a speech recognizer with two-pass configuration, a prosodic module checks the morpheme which is hypothesized as a filler and outputs the likelihood score of the morpheme being a filler. When the likelihood score exceeds a threshold, a prosodic score is added to the language score of the hypothesis as a bonus. The prosodic module is constructed using five-layered perceptron. With inputs on prosodic features of current, preceding and following morphemes, the perceptron calculates the filler likelihood. A comparative recognition experiment with and without the prosodic module was conducted for 100 utterances of spontaneous speech, which are included in the corpus of academic meeting presentations of the Corpus of Spontaneous Japanese. Seven fillers originally miss-recognized as non- fillers are correctly recognized as fillers when the prosodic module is used. No fillers originally recognized as fillers are wrongly recognized as non-fillers. Although a few non-filler morphemes are miss-recognized as other non-filler morphemes by the introduction of the prosodic module, they can be corrected by properly setting parameters of the 2 nd pass search process. These results indicate the proposed scheme can improve the performance of spontaneous speech recognition. 1. Introduction In view of the importance of prosodic features in human speech perception, a rather large number of research works have already been devoted for developing modules of prosodic event detection and for incorporating them into speech recognition process. The authors have been developing several methods for continuous speech recognition along this line, and realized certain improvements in the recognition rates [1-4]. However, in most of the works, including ours, recognition of text-reading style speech was addressed. In such cases, large amount of data are usually obtainable for training acoustic and language models, and a high recognition performance is obtainable without relying on prosodic features. Therefore the effect of the prosodic modules comes unclear in the total recognition process. The situation may be different, when it comes difficult to obtain enough data, such as the case of spontaneous speech. Spontaneous speech may include number of irregularities, such as hesitations (fillers/pauses), re-statements, and so on, which may largely degrade speech recognition performance. Since these parts show prosodic features different from other parts (of normal utterance) [5], they may be detectable by viewing fundamental frequency (F 0 ) contours, power/amplitude contours, and segmental duration patterns, and their information may contribute to the final recognition results. The most naïve way of using filler information for speech recognition is to detect filler portions independently and skip those portions from the recognition process. However, this may not work well, because the filler detection with prosodic features may include a certain number of errors even with sophisticated schemes. From this point of view, we have developed a new method of using filler information for continuous speech recognition: to calculate the likelihood of fillers appearing in the decoding process of speech recognition using prosodic features, and, if the likelihood is high, increase the score of the hypothesis with the fillers. As for the calculation of likelihood, a neural network was adopted, though other options were also possible. The rest of the paper is organized as follows: The outline of the proposed method is explained in Section 2. After a short explanation on the speech material in Section 3, the neural network for the calculation of filler likelihood (prosodic module) is explained with experimental results in Section 4. Results of speech recognition experiments are shown in Section 5. Section 6 concludes the paper. 2. Configuration of the Method Figure 1: Total configuration of the proposed method. Figure 1 shows the total configuration of the proposed method. As for the speech recognition engine, Julius developed as an open-software for continuous speech recognition is used. The engine conducts quick coarse search (1 st pass search) first and then conducts detailed search backwoods (2 nd pass search) [6]. The 1 st pass is the frame synchronous beam search with Speech Prosody 2006 Dresden, Germany May 2-5, 2006 ISCA Archive http://www.isca-speech.org/archive Speech Prosody 2006, Dresden, Germany, May2-5, 2006