Language model expansion using webdata
for spoken document retrieval
Ryo Masumura, Seongjun Hahm, Akinori Ito
Graduate School of Engineering, Tohoku University
{ryo77373,branden65,aito}@spcom.ecei.tohoku.ac.jp
Abstract
In recent years, there has been increasing demand for ad hoc
retrieval of spoken documents. We can use existing text re-
trieval methods by transcribing spoken documents into text
data using a Large Vocabulary Continuous Speech Recognizer
(LVCSR). However, retrieval performance is severely dete-
riorated by recognition errors and out-of-vocabulary (OOV)
words. To solve these problems, we previously proposed an
expansion method that compensates the transcription by using
text data downloaded from the Web. In this paper, we introduce
two improvements to the existing document expansion frame-
work. First, we use a large-scale sample database of webdata as
the source of relevant documents, thus avoiding the bias intro-
duced by choosing keywords in the existing methods. Next, we
use a document retrieval method based on a statistical language
model (SLM), which is a popular framework in information re-
trieval, and also propose a new smoothing method considering
recognition errors and missing keywords. Retrieval experiments
show that the proposed methods yield a good results.
Index Terms: Spoken document retrieval, statistical language
models, World Wide Web
1. Introduction
With the development of information and communication tech-
nology, we can now access huge amounts of multimedia con-
tents including recorded audio and video. However, is is diffi-
cult to perform content-based searches of such data compared
with text data. Most search engines provide a function for
searching data based on metadata such as titles, tags or text data
surrounding the multimedia contents. Large vocabulary contin-
uous speech recognition (LVCSR) is one of the most promising
technologies for content-based searches of multimedia contents
including human speech. Using LVCSR, we can convert speech
into text and search the speech-based content using text-based
search techniques.
Recent text retrieval methods are based on the statistical
language model (SLM) [1]. These methods use a mathematical
framework rather than heuristics such as tf-idf, and have been
proved to be more accurate than the classical heuristic retrieval
methods.
However, there are a couple of problems when applying
text retrieval methods to transcriptions produced by LVCSR.
The first problem is recognition errors: automatic transcrip-
tions generated by a speech recognizer contain many recogni-
tion errors, and so important words are missing when search-
ing those documents. The other problem is out-of-vocabulary
(OOV) words, which are words not included in the dictionary
of the speech recognizer. As the recognizer cannot recognize
OOV words, those words appearing in the spoken document in-
evitably become recognition errors. As a result of these two
problems, the accuracy of document retrieval for spoken doc-
uments using LVCSR is much lower than that for written doc-
uments; for research in the field of spoken document retrieval,
these problems need to be solved.
Many attempts have been made to solve these problems,
such as by using multiple recognition hypotheses [2], topic
modeling [3], and document clustering techniques [4]. Al-
though these methods can solve the recognition error problem,
they cannot solve the OOV problem because they improve the
recognition result within the vocabulary of the speech recog-
nizer.
To deal with the OOV problem, we are developing a method
that acquires new words from the World Wide Web [5]. This
approach first extracts keywords from the automatic transcrip-
tion, and then retrieves Web documents using the extracted key-
words. The downloaded documents are used for compensating
the index generated from the automatic transcription. A similar
approach has also been proposed by Sugimoto et al. [6].
Two problems of these Web-based approaches can be
pointed out. The first one is that these methods use only a few
keywords as representatives of the spoken document, yet it is
difficult to express the features of a document using only a few
keywords. The second problem is that these works use the clas-
sical retrieval method based on a vector space model. It is desir-
able to use the state-of-the-art document retrieval method based
on the statistical language model for exploiting the advances in
information retrieval technology.
In this paper, we propose a spoken document retrieval
method based on document expansion using documents down-
loaded from the Web. There are two novel points in this work.
First, we create a database downloaded from the Web so that the
database contains as many kinds of words as possible, thus in-
creasing the possibility of acquiring OOV words, and the whole
transcription of a spoken document is used for choosing data
from the database for document expansion. Second, we use a
document retrieval framework based on the SLM. We not only
apply the existing method for text retrieval, but also propose
a new extension of the SLM so that spoken documents can be
retrieved with high accuracy.
This paper is organized as follows. Section 2 briefly de-
scribes information retrieval based on statistical language mod-
els and the problems of using automatic transcriptions including
recognition errors. In Section 3, we propose a model expansion
method using webdata to solve the problems of the language
modeling approach. In Section 4, we carry out a retrieval ex-
periment to verify the effectiveness of the proposed method.
Copyright © 2011 ISCA 28 - 31 August 2011, Florence, Italy
INTERSPEECH 2011
2133