OPTIMAL SUBSET SELECTION FROM TEXT DATABASES
Jilei Tian, Jani Nurminen and Imre Kiss
Multimedia Technologies Laboratory
Nokia Research Center, Tampere, Finland
{jilei.tian, jani.k.nurminen, imre.kiss}@nokia.com
ABSTRACT
Speech and language processing techniques, such as automatic
speech recognition (ASR), text-to-speech (TTS) synthesis,
language understanding and translation, will play a key role in
tomorrow’s user interfaces. Many of these techniques employ
models that must be trained using text data. In this paper, we
introduce a novel method for training set selection from text
databases. The quality of the training subset is ensured using
an objective function that effectively describes the coverage
achieved with the strings in the subset. The validity of the
subset selection technique is verified in an automatic
syllabification task. The results clearly indicate that the
proposed systematic selection approach maximizes the quality
of the training set, which in turn improves the quality of the
trained model. The presented idea can be used in a wide
variety of language processing applications that require
training with text databases.
1 INTRODUCTION
Most automatic speech recognition (ASR) and text-to-speech
(TTS) systems contain models that have to be trained with text
data. Typical examples can be found from many parts of the
systems. In pronunciation modeling, some data-driven
approach, such as neural network based methods or decision
tree based methods [6], are often applied, especially for
languages like English. These statistical models are trained
using a pronunciation dictionary containing grapheme-to-
phoneme entries. In text-based language identification [8], the
model is trained using a multilingual text corpus that consists
of word entries from the target languages. In the data-driven
syllabification task [7], the model is trained using text-based
pronunciations and the corresponding syllable structures.
In all data-driven approaches, the selection of a suitable
training set can be regarded as a very important step in the
training process. In general, the performance of any trained
model depends quite strongly on the quality of the text data
used in the training. With text-based data, the importance of
the training set selection is very pronounced since the
generation of the training data entries is often very time and
resource consuming and requires language-specific skills. In
this paper, we show that systematic training set selection
results in enhanced model performance and/or offers the
possibility to use a smaller training set size. In practice, the
reduced training set size brings two significant additional
benefits. First, the amount of manual annotation work is
reduced, which in turn decreases the probability of errors and
inconsistencies in the annotations. Second, the memory
consumption and the computational load caused by the
training process are lowered. In some cases this advantage
propagates to the trained model as well; the size of a decision
tree model, for example, depends on the size of the training
set.
Despite the evident importance of the training set
selection, this step is often neglected in practice. Usually, the
training set is obtained by collecting a set of random entries
from a larger text database or by decimating a sorted corpus.
The drawback of these solutions is that the amount of
meaningful information in the selected text data set is not
maximized. The random selection method is rather coarse and
does not produce consistent results. The method of decimating
a sorted data corpus, on the other hand, only uses a limited
number of the initial characters of the strings and thus does
not guarantee good performance.
In this paper, we present a method that can quasi-
optimally select a subset from a text database in such a
manner that the text coverage is maximized. To achieve this,
we define an objective function that is optimized in the subset
selection. The objective function measures the “subset
distance” using the generalized Levenshtein distances
between the text strings. This paper also introduces an
algorithm for optimizing the objective function. For practical
applications with large databases, the algorithm can be
modified in order to speed up the processing or to lower the
memory consumption, but the main idea and the objective
function will remain useful in all cases. To demonstrate the
usefulness of the proposed approach, we evaluate it in the
syllabification task.
The text subset selection method introduced in this paper
can be used in a wide variety of different applications. One
good example is the language identification task [8], in which
the proposed approach makes it possible to easily balance the
number of training set entries from each target language while
at the same time giving a good coverage for every target
language. In addition to the training set selection task
discussed extensively in this paper, it is possible to employ
the same techniques for clustering a text database. Moreover,
when used together with a meaningful distance measure, such
as the generalized Levenshtein distance, the proposed
approach enables the use of vector quantization techniques on
text data.
The remainder of the paper is organized as follows. We
first describe the generalized Levenshtein distance and
introduce the basic principles of the text database selection
algorithm in Section 2. In Section 3, we describe the
syllabification task used as the practical example by briefly
reviewing the syllable structure grammar and the neural
network based syllabification method. The performance of the
proposed subset selection approach is evaluated in the
I - 305 0-7803-8874-7/05/$20.00 ©2005 IEEE ICASSP 2005
➠ ➡