On the use of Web resources and natural language processing techniques to improve automatic speech recognition systems Gw´ enol´ e Lecorv´ e, Guillaume Gravier, Pascale S´ ebillot IRISA Campus de Beaulieu, 35042 RENNES, France {gwenole.lecorve,guillaume.gravier,pascale.sebillot}@irisa.fr Abstract Language models used in current automatic speech recognition systems are trained on general-purpose corpora and are therefore not relevant to transcribe spoken documents dealing with successive precise topics, such as long multimedia streams, frequently tackling reports and debates. To overcome this problem, this paper shows that Web resources and natural language processing techniques can be effective to automatically collect a topic specific corpora from the Internet in order to adapt the baseline language model of an automatic speech recognition system. We detail how to characterize the topic of a segment and how to collect Web pages from which a topic- specific language model can be trained. We finally present experiments where an adapted language model is obtained by combining the topic-specific language model with the general purpose one to obtain new transcriptions. The results show that our topic adaptation technique leads to significant transcription quality gains. 1. Introduction Using speech transcriptions is an effective way for the in- dexing of long multimedia streams, like 24h of TV or radio broadcast. To generate these transcriptions, current auto- matic speech recognition (ASR) systems are based on lan- guage models (LM) which gather word sequence probabil- ities, typically n-gram probabilities, and assist the system in discriminating utterances with the highest likelihood. In practice, these n-gram probabilities are estimated globally once and for all on large multi-topic corpora. However, since n-gram probabilities change with topics, these multi- topic LMs are not accurate to transcribe spoken documents successively tackling various topics, like broadcast news or debates. To circumvent this problem, this paper proposes to use nat- ural language processing techniques to automatically adapt a general purpose LM to any topic, using the Internet as an open resource to dynamically gather an adaptation corpus. The final goal in this paper is to improve the transcription quality of ASR systems on topic specific segments. To this end, experiments are carried out on a large set of various radio broadcast news shows. The paper is organized as follows: Section 2 presents an overview of the complete LM adaptation technique. Works related with all the steps of the adaptation process are pre- sented in Section 3. Sections 4 to 6 detail each key point of the proposed approach and experimental results are given in Section 7. 2. Overall approach As presented in Fig. 1, the basic idea of our topic adapta- tion technique is to use the Internet as an open linguistic resource from which topic-specific texts can be retrieved to estimate new n-gram probabilities for almost any topic. The process described below is applied to single topic segments, previously transcribed with a baseline general- purpose LM, where such segments may come either from the thematic segmentation of a long speech stream or from smaller documents like broadcast news shows, as consid- Figure 1: Overview of our Web-based language model adaptation process. ered in this paper, or podcasts. For a segment, keywords are extracted based on information retrieval techniques in or- der to characterize the topic. These terms are used to form queries that are submitted to a Web search engine (Yahoo!). Retrieved pages are then browsed and filtered to make a corpus from which topic-specific LM probabilities are es- timated before being combined with the probabilities from the general-purpose LM. Finally, new, and hopefully better, transcriptions are obtained from the ASR system using the adapted LM. This approach raises several questions at each step of the process. First, how to extract keywords which, on the one