Language model expansion using webdata for spoken document retrieval Ryo Masumura, Seongjun Hahm, Akinori Ito Graduate School of Engineering, Tohoku University {ryo77373,branden65,aito}@spcom.ecei.tohoku.ac.jp Abstract In recent years, there has been increasing demand for ad hoc retrieval of spoken documents. We can use existing text re- trieval methods by transcribing spoken documents into text data using a Large Vocabulary Continuous Speech Recognizer (LVCSR). However, retrieval performance is severely dete- riorated by recognition errors and out-of-vocabulary (OOV) words. To solve these problems, we previously proposed an expansion method that compensates the transcription by using text data downloaded from the Web. In this paper, we introduce two improvements to the existing document expansion frame- work. First, we use a large-scale sample database of webdata as the source of relevant documents, thus avoiding the bias intro- duced by choosing keywords in the existing methods. Next, we use a document retrieval method based on a statistical language model (SLM), which is a popular framework in information re- trieval, and also propose a new smoothing method considering recognition errors and missing keywords. Retrieval experiments show that the proposed methods yield a good results. Index Terms: Spoken document retrieval, statistical language models, World Wide Web 1. Introduction With the development of information and communication tech- nology, we can now access huge amounts of multimedia con- tents including recorded audio and video. However, is is diffi- cult to perform content-based searches of such data compared with text data. Most search engines provide a function for searching data based on metadata such as titles, tags or text data surrounding the multimedia contents. Large vocabulary contin- uous speech recognition (LVCSR) is one of the most promising technologies for content-based searches of multimedia contents including human speech. Using LVCSR, we can convert speech into text and search the speech-based content using text-based search techniques. Recent text retrieval methods are based on the statistical language model (SLM) [1]. These methods use a mathematical framework rather than heuristics such as tf-idf, and have been proved to be more accurate than the classical heuristic retrieval methods. However, there are a couple of problems when applying text retrieval methods to transcriptions produced by LVCSR. The first problem is recognition errors: automatic transcrip- tions generated by a speech recognizer contain many recogni- tion errors, and so important words are missing when search- ing those documents. The other problem is out-of-vocabulary (OOV) words, which are words not included in the dictionary of the speech recognizer. As the recognizer cannot recognize OOV words, those words appearing in the spoken document in- evitably become recognition errors. As a result of these two problems, the accuracy of document retrieval for spoken doc- uments using LVCSR is much lower than that for written doc- uments; for research in the field of spoken document retrieval, these problems need to be solved. Many attempts have been made to solve these problems, such as by using multiple recognition hypotheses [2], topic modeling [3], and document clustering techniques [4]. Al- though these methods can solve the recognition error problem, they cannot solve the OOV problem because they improve the recognition result within the vocabulary of the speech recog- nizer. To deal with the OOV problem, we are developing a method that acquires new words from the World Wide Web [5]. This approach first extracts keywords from the automatic transcrip- tion, and then retrieves Web documents using the extracted key- words. The downloaded documents are used for compensating the index generated from the automatic transcription. A similar approach has also been proposed by Sugimoto et al. [6]. Two problems of these Web-based approaches can be pointed out. The first one is that these methods use only a few keywords as representatives of the spoken document, yet it is difficult to express the features of a document using only a few keywords. The second problem is that these works use the clas- sical retrieval method based on a vector space model. It is desir- able to use the state-of-the-art document retrieval method based on the statistical language model for exploiting the advances in information retrieval technology. In this paper, we propose a spoken document retrieval method based on document expansion using documents down- loaded from the Web. There are two novel points in this work. First, we create a database downloaded from the Web so that the database contains as many kinds of words as possible, thus in- creasing the possibility of acquiring OOV words, and the whole transcription of a spoken document is used for choosing data from the database for document expansion. Second, we use a document retrieval framework based on the SLM. We not only apply the existing method for text retrieval, but also propose a new extension of the SLM so that spoken documents can be retrieved with high accuracy. This paper is organized as follows. Section 2 briefly de- scribes information retrieval based on statistical language mod- els and the problems of using automatic transcriptions including recognition errors. In Section 3, we propose a model expansion method using webdata to solve the problems of the language modeling approach. In Section 4, we carry out a retrieval ex- periment to verify the effectiveness of the proposed method. Copyright © 2011 ISCA 28 - 31 August 2011, Florence, Italy INTERSPEECH 2011 2133