STRATEGIES FOR LANGUAGE MODEL WEB-DATA COLLECTION Vincent Wan , Thomas Hain Department of Computer Science University of Shefﬁeld, UK {v.wan,t.hain}@dcs.shef.ac.uk ABSTRACT This paper presents an analysis of the use of textual information collected from the internet via a search engine for the purpose of building domain speciﬁc language models. A framework to analyse the effect of search query formulation on the resulting web-data language model performance in an evaluation is devel- oped. The framework gives rise to improved methods of selecting n-gram search engine queries, which return documents that make better domain speciﬁc language models. 1. INTRODUCTION The construction of a competitive automatic speech recognition (ASR) system requires considerable amounts of data for both acoustic and language modelling. It is a well known disadvan- tage of such systems that, for optimal performance, the data has to originate from the speciﬁc task and domain. For acoustic mod- elling the recording of sufﬁcient data can be costly and time con- suming. In the case of language modelling collection of sufﬁ- cient data is overwhelming, especially in the case of tasks cov- ering inter-human interaction. The use of background language models trained on large amounts of spoken and written text helps but the overall system performance is still considerably poorer without in-domain data. Recently it was shown that data collected from the world- wide-web via a search engine could aid in the collection of in- domain data [1, 2, 3, 4]. Search engine queries are formed from n-grams obtained from a small sample of in-domain data. The text retrieved from these queries is then normalised, ﬁltered and used to train a standard n-gram language model. While that model could be used directly it was found to be beneﬁcial to in- terpolate it with a generic background model. This approach is in particular appealing both for tasks concerning conversational speech as well as speech from highly specialised areas because the world-wide-web holds many transcripts of speech as well as a wealth of specialised material. Wide-spread use of the tech- nique was made for the transcription of conversational telephone speech in recent U.S. NIST evaluations [5] and for meeting room transcriptions [6]). Experimental evidence suggests that the selection of the search queries has a considerable impact on the performance of the re- sulting language models, both in terms of perplexity and word This work was partly supported by the European Union 6th FWP IST Inte- grated Project AMI (Augmented Multi-party Interaction, FP6-506811, publica- tion AMI-145) error rates. This fact was also noted in recent work by Sethy et al. [7], who proposed multiple changes to the original techniques: Firstly, the search for query terms is based on the relative entropy between an in-domain topic model and a background model; and secondly, both the topic and the background language are up- dated according to relevance estimates based on log-probabilities of the prior language models then data selection was performed on an utterance level. This paper develops a framework to analyse the effect of search query formulation on the resulting performance in an eval- uation. In contrast to the work by [7] the formulation operates on a “per n-gram” basis. In order to retain robustness with in- domain data we derive simple measures for the selection of query terms. The rest of the paper is organised as follows: section 2 de- scribes web-data collection mathematically and motivates the use of search models in section 3. Section 4 provides an analysis and supporting experimental results. Section 5 concludes the paper. 2. COLLECTING WEB-DATA Let B denote the background text, for example, a corpus of generic conversational speech that is topic independent. Let T be a small corpus that indicates the topic of interest and serves as the seed for the collection of a larger corpus C from the internet. Let E be the evaluation corpus, which may be identical to T but in reality should be different. Assume that the language models are unsmoothed n-grams of arbitrary history depth, so the probability of an n-gram given a model derived from B is denoted P (w|h, B)= N (w, h, B) N (h, B) (1) where h is the history of word w, N (w, h, B) is the count of the n-gram (w, h) in corpus B and N (h, B) is similarly deﬁned as the count of h in B. The log likelihood of the corpus E given the model derived from B is log P (E|B)=  w  h N (w, h, E)logP (w|h, B) (2) When collecting web-data C, the aim is to ensure that the language model BC derived from an interpolation of B and C is more likely to generate E than the model B alone. log P (E|BC) > log P (E|B) (3) I  1069 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006