Dynamic Language Model Adaptation Using Presentation Slides for Lecture Speech Recognition Hiroki Yamazaki, Koji Iwano, Koichi Shinoda, Sadaoki Furui and Haruo Yokota Department of Computer Science, Tokyo Institute of Technology, Japan yamazaki@ks.cs.titech.ac.jp, {iwano, shinoda, furui, yokota}@cs.titech.ac.jp Abstract We propose a dynamic language model adaptation method that uses the temporal information from lecture slides for lecture speech recognition. The proposed method consists of two steps. First, the language model is adapted with the text information extracted from all the slides of a given lecture. Next, the text information of a given slide is extracted based on temporal in- formation and used for local adaptation. Hence, the language model, used to recognize speech associated with the given slide changes dynamically from one slide to the next. We evaluated the proposed method with the speech data from four Japanese lecture courses. Our experiments show the effectiveness of our proposed method, especially for keyword detection. The F- measure error rate for lecture keywords was reduced by 2.4%. Index Terms: language model adaptation, speech recognition, classroom lecture speech. 1. Introduction Recent advancements in computer and storage technology en- able archiving large multimedia databases. The databases of classroom lectures in universities and colleges are particularly useful knowledge resources, and they are expected to be used in education systems. Recently much effort has been made to construct educa- tional systems that use the multimedia content of classroom lectures to support distant-learning [1, 2, 3, 4, 5]. Among the various kinds of content related to lectures, the transcription of speech data is expected to be the most important for in- dexing and searching lecture contents [2, 6]. Therefore, high- level speech recognition engine for lectures is required. Lecture speech recognition has been studied extensively. Many research projects for lecture transcriptions, such as the European project CHIL (Computers in the Human Interaction Loop) [8], and the American iCampus Spoken Lecture Processing project [9], have been conducted. Trancoso et. al. [7] investigated the automatic transcription of classroom lectures in Portuguese. Large databases of conference presentations, such as the Corpus of Spontaneous Japanese (CSJ) [10, 11], and the TED corpus [12] have been collected to improve speech recogni- tion accuracy. With the use of these databases, a state-of- the-art speech recognition systems for conference presentations achieves accuracy of 70-80%. Hence, the recognition results provided by these systems are good enough to be used for speech summarization and speech indexing [13]. The speak- ing style of classroom lectures is, however, much different from that of lectures in meetings or conferences. Classroom lectures are not always practiced in advance, and the same phrases are repeated many times for emphasis. The lecture speaking style is closer to that in dialogue because lecturers are always ready to be interrupted by questions from students. The spontaneity of this kind of speech is much higher than other kinds of pre- sentations; the lectures are characterized by strong coarticula- tion effects, non-grammatical constructions, hesitations, repeti- tions, and ﬁlled pauses. For these reasons, speech recognition for classroom lecture speech is generally more difﬁcult than that of speeches in conferences or meetings; its recognition accuracy is around 40-60%. Furthermore, no large database of classroom lecture speech is available for training acoustic and language models. In classrooms, lecturers often use various materials, e.g., textbooks or slides, to help their students understand. Since those materials include many keywords that also appear in lec- ture speech, they are expected to be useful for language model- ing in speech recognition. Several adaptation methods for lan- guage models using such content have already been proposed for lecture speech recognition. For example, Togashi et. al. [14] proposed a method of using the text information in presentation slides. If lecture speech is accompanied by slides, a strong corre- lation can be observed between slides and speech. In partic- ular, the speech corresponding to a given slide contains most of the text information presented in the slide. We expect this relation between speech and text information of the slide can improve the model adaptation for lecture speech recognition. We propose a dynamic adaptation method for language model- ing that applies text information from slides. In this method, a slide-dependent language model is constructed for each slide, and this model is used afterwards to recognize the speech as- sociated with the given slide. The language model is changed dynamically as the lecture progresses. This paper is organized as follows. In Section 2, the base system applied in our studies is introduced. In Section 3, the proposed language model adaptation method is explained, and in Section 4, the effectiveness of the proposed method is dis- cussed. 2. UPRISE: Uniﬁed Presentation Contents Retrieval by Impression Search Engine UPRISE (Uniﬁed Presentation Contents Retrieval by Impres- sion Search Engine) [1, 15] is a lecture presentation system to support distant-learning. It stores many types of multime- dia materials, such as texts, pictures, graphs, images, sounds, voices, and videos, and provides a uniﬁed presentation view (Figure 1) as a lecture video retrieval system. The retrieval system returns appropriate lecture video scenes to match given keywords. Since the speech information in lectures is used to narrow down the search candidates [6], a high level of speech recognition accuracy is strongly required. INTERSPEECH 2007 August 27-31, Antwerp, Belgium 2349