STRATEGIES FOR LANGUAGE MODEL WEB-DATA COLLECTION
Vincent Wan , Thomas Hain
Department of Computer Science
University of Sheffield, UK
{v.wan,t.hain}@dcs.shef.ac.uk
ABSTRACT
This paper presents an analysis of the use of textual information
collected from the internet via a search engine for the purpose
of building domain specific language models. A framework to
analyse the effect of search query formulation on the resulting
web-data language model performance in an evaluation is devel-
oped. The framework gives rise to improved methods of selecting
n-gram search engine queries, which return documents that make
better domain specific language models.
1. INTRODUCTION
The construction of a competitive automatic speech recognition
(ASR) system requires considerable amounts of data for both
acoustic and language modelling. It is a well known disadvan-
tage of such systems that, for optimal performance, the data has
to originate from the specific task and domain. For acoustic mod-
elling the recording of sufficient data can be costly and time con-
suming. In the case of language modelling collection of suffi-
cient data is overwhelming, especially in the case of tasks cov-
ering inter-human interaction. The use of background language
models trained on large amounts of spoken and written text helps
but the overall system performance is still considerably poorer
without in-domain data.
Recently it was shown that data collected from the world-
wide-web via a search engine could aid in the collection of in-
domain data [1, 2, 3, 4]. Search engine queries are formed from
n-grams obtained from a small sample of in-domain data. The
text retrieved from these queries is then normalised, filtered and
used to train a standard n-gram language model. While that
model could be used directly it was found to be beneficial to in-
terpolate it with a generic background model. This approach is
in particular appealing both for tasks concerning conversational
speech as well as speech from highly specialised areas because
the world-wide-web holds many transcripts of speech as well as
a wealth of specialised material. Wide-spread use of the tech-
nique was made for the transcription of conversational telephone
speech in recent U.S. NIST evaluations [5] and for meeting room
transcriptions [6]).
Experimental evidence suggests that the selection of the search
queries has a considerable impact on the performance of the re-
sulting language models, both in terms of perplexity and word
This work was partly supported by the European Union 6th FWP IST Inte-
grated Project AMI (Augmented Multi-party Interaction, FP6-506811, publica-
tion AMI-145)
error rates. This fact was also noted in recent work by Sethy et
al. [7], who proposed multiple changes to the original techniques:
Firstly, the search for query terms is based on the relative entropy
between an in-domain topic model and a background model; and
secondly, both the topic and the background language are up-
dated according to relevance estimates based on log-probabilities
of the prior language models then data selection was performed
on an utterance level.
This paper develops a framework to analyse the effect of
search query formulation on the resulting performance in an eval-
uation. In contrast to the work by [7] the formulation operates
on a “per n-gram” basis. In order to retain robustness with in-
domain data we derive simple measures for the selection of query
terms.
The rest of the paper is organised as follows: section 2 de-
scribes web-data collection mathematically and motivates the use
of search models in section 3. Section 4 provides an analysis and
supporting experimental results. Section 5 concludes the paper.
2. COLLECTING WEB-DATA
Let B denote the background text, for example, a corpus of generic
conversational speech that is topic independent. Let T be a small
corpus that indicates the topic of interest and serves as the seed
for the collection of a larger corpus C from the internet. Let E be
the evaluation corpus, which may be identical to T but in reality
should be different.
Assume that the language models are unsmoothed n-grams
of arbitrary history depth, so the probability of an n-gram given
a model derived from B is denoted
P (w|h, B)=
N (w, h, B)
N (h, B)
(1)
where h is the history of word w, N (w, h, B) is the count of the
n-gram (w, h) in corpus B and N (h, B) is similarly defined as
the count of h in B.
The log likelihood of the corpus E given the model derived
from B is
log P (E|B)=
w
h
N (w, h, E)logP (w|h, B) (2)
When collecting web-data C, the aim is to ensure that the
language model BC derived from an interpolation of B and C is
more likely to generate E than the model B alone.
log P (E|BC) > log P (E|B) (3)
I 1069 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006