Fitting German into N-Gram Language Models Robert Hecht, Jürgen Riedler, and Gerhard Backfried Speech, Artificial Intelligence, and Language Laboratories Operngasse 20B, A-1040 Vienna, Austria E-mail: {robert,juergen,gerhard}@sail-technology.com Abstract. We report on a series of experiments addressing the fact that German is less suited than English for word-based n-gram language models. Several systems were trained at different vocabulary sizes using various sets of lexical units. They were evaluated against a newly created corpus of German and Austrian broadcast news. 1 Introduction The performance of an ASR system is to a large extent determined by how well the language model and vocabulary fit the speech to be recognized. The recognition lexicon should contain most of the words likely to appear, i.e. a minimum out-of-vocabulary (OOV) rate should be achieved, as low lexical coverage leads to high word error rates (WER). State-of-the-art ASR systems typically utilize full form word lexica and language models (LM). However, languages like German exhibit a large variety of distinct lexical forms. Thus, a morphological decomposition of orthographic words could improve OOV rates trading-off with larger WER due to neglection of coarticulatory effects regarding pronunciation and shorter span of language model contexts. Amongst the main mechanisms for creating new words in German are such diverse elements as inflection, derivation, and compounding. Inflections and derivations are typically formed with a limited number of short affixes. Compounding is a very productive process in which the tendency to convey information in a compressed form leads to accumulation of more and more components [1]. New words are formed spontaneously and frequently occasionalisms emerge which will not be seen in any texts used to train language models. German orthography requires their constituents to be written as one word, thus resulting in a potentially unlimited set of compounds. Remedies for strongly inflecting and compounding languages roughly divide into two branches: morphology-based [2,3] and data-driven [4] compound word splitting, both often accompanied by grammar models other than simple n-grams (e.g. class-based or long- distance models). In the broadcast news domain, [2] focussed on identifying and processing the statistically most relevant sources of lexical variety. A set of about 450 decomposition rules, derived using statistics from a large text corpus together with partial inflection stripping reduced OOV rates from 4.5 % to 2.9 %. The study described in [3] is based on the Verbmobil database consisting of over 400 human-to-human dialogues. Decomposition along phonological lines reduced vocabulary size as well as perplexity, which in turn lead to a more robust language model. In a further step, suppression of decomposition of very frequent compounds yielded lowest WERs. Petr Sojka, Ivan Kopeˇ cek and Karel Pala (Eds.): TSD 2002, LNAI 2448, pp. 341–346, 2002. c Springer-Verlag Berlin Heidelberg 2002