Very Large Vocabulary Speech Recognition System for Automatic Transcription of Czech Broadcast Programs Jan Nouza, Dana Nejedlova, Jindrich Zdansky, Jan Kolorenc SpeechLab, Department of Electronics and Signal Processing Technical University of Liberec, Hálkova 6, 461 17 Liberec, Czech Republic {jan.nouza, dana.nejedlova, jindrich.zdansky, jan.kolorenc}@vslib.cz Abstract This paper describes the first speech recognition system capable of transcribing a wide range of spoken broadcast programs in Czech language with the OOV rate being below 3 per cent. To achieve that level we had to a) create an optimized 200k word vocabulary with multiple text and pronunciation forms, b) extract an appropriate language model from a 300M word text corpus and c) develop an own decoder specially designed for the lexicon of that size. The system was tested on various types of broadcast programs with the following results: the Czech part of the European COST278 database of TV news (71.5 % accuracy rate on complete news streams, 82.7 % on their clean parts), radio news (80.2 %), read commentaries (78.6 %), broadcast debates (74.3 %) and recordings of the state presidents’ speeches (85.8 %). 1. Introduction Automatic transcription of spoken broadcast programs is one of the most challenging tasks in the speech processing domain. Recently, several systems with that purpose have been developed for major languages. The results reported in papers [1-3] demonstrate that the machine transcription of broadcast news in English, Spanish or Japanese performs already quite well and relatively fast with vocabularies whose size may be limited to 60 thousand most frequent words. However, there are languages that need much larger vocabularies to cover the width of their lexical inventories. One of them is, for example, German, in which the lexicon must comprise up to 300k words to get the same text coverage as in English [4]. The transcription of spoken Czech is even a more complex task. Unlike German, where the lexicon size grows mainly due to the compounding phenomenon, the major difficulty of Czech (as well as Russian and other Slavic languages) consists in the existence of many inflected forms that can be derived from a single lexical lemma (up to 14 for a noun, 100 for an adjective and even more for a verb). These forms reflect the complex grammar rules that are governed by the principle of gender, number and case agreement between the interrelated parts of speech in a sentence [5]. The side effect of these inflected forms and their strong grammatical relation is a relatively free word order in a Czech utterance, i.e. something that goes against the assumptions of the standard N-gram language modeling method. The first attempts to adopt the existing speech recognition platforms (the AT&T decoder and the SRILM toolkit) to the Czech broadcast news transcription task dates to 2001 [6]. It was clear, however, that the vocabulary size limits posed by those tools were critical for the task and they could not be efficiently eliminated by alternative approaches, like e.g. a morphological decomposition of words. In this paper we demonstrate that if we want to achieve significantly better results we must work with lexicons that contain several hundreds of thousands words and word forms. We further show how to optimize the lexicon building process, how to benefit from multiple text and pronunciation variants of lexical items, and in the end we briefly describe the design of the own decoder fitting well to the very large vocabulary task. In the experimental part we present the results that were achieved in a wide range of applications. We do not focus just on the classical broadcast news task. Instead, we test the system also on other interesting jobs, like the transcription of radio debates, daily commentaries or broadcast speeches given by politicians. 2. Text corpus As the research moves from domain oriented tasks to unconstrained speech recognition with virtually unlimited vocabulary, the importance of large text corpora increases dramatically. They are needed for compiling lexicons that will optimally represent the given language, for estimating parameters in probabilistic N-gram models, or for monitoring short and long-term socio-linguistic evolution. For Czech language (which is spoken by some 10 million people) the National Czech Corpus has been compiled at the Charles University in Prague [7]. However, its primary orientation on linguistic research and its limited size makes it inappropriate for our purpose. Therefore, three years ago we started to build our own corpus of Czech texts. 2.1. 300M word corpus and its characteristics The corpus is compiled from the texts that have been available in electronic form. Recently its size is about 1.9 GB and after some cleaning it contains 290 million words. The majority of the texts (275M words) come from Czech newspapers published in period 1990 to 2004, which were obtainable either on commercial CDs or on internet. The remaining part is constituted from other electronic sources, mainly novels and professional books available on web. Unfortunately, only a 12 MB portion represents the transcription of spoken language – it is mainly TV and radio news. The analysis of the corpus discovered 1,910,641 distinct items in the texts. From these, 788,251 passed successfully the spell-checker built in the Czech version of the MS Word editor. All these passing words were general lexical items, not proper names. The proper names (i.e. those with the initial capital letter) made a group of other 400K words. The remaining part of almost 700M tokens was not analyzed in detail. We just assume that it consists mainly of typing errors, out-of-standard colloquial words and items coming from foreign languages. The analysis showed that we must expect INTERSPEECH 2004 - ICSLP 8 th International Conference on Spoken Language Processing ICC Jeju, Jeju Island, Korea October 4-8, 2004 ISCA Archive http://www.isca-speech.org/archive 10.21437/Interspeech.2004-170