TOPIC IDENTIFICATION FROM AUDIO RECORDINGS
USING WORD AND PHONE RECOGNITION LATTICES
Timothy J. Hazen, Fred Richardson and Anna Margolis
MIT Lincoln Laboratory
Lexington, Massachusetts, USA
ABSTRACT
In this paper, we investigate the problem of topic identification from
audio documents using features extracted from speech recognition
lattices. We are particularly interested in the difficult case where
the training material is minimally annotated with only topic labels.
Under this scenario, the lexical knowledge that is useful for topic
identification may not be available, and automatic methods for ex-
tracting linguistic knowledge useful for distinguishing between top-
ics must be relied upon. Towards this goal we investigate the prob-
lem of topic identification on conversational telephone speech from
the Fisher corpus under a variety of increasingly difficult constraints.
We contrast the performance of systems that have knowledge of the
lexical units present in the audio data, against systems that rely en-
tirely on phonetic processing.
Index Terms— Audio document processing, topic identifica-
tion, topic spotting.
1. INTRODUCTION
As new technologies increase our ability to create, disseminate, and
locate media, the need for automatic processing of these media also
increases. Spoken audio data in particular is a media which could
benefit greatly from automatic processing. Because audio data is
notoriously difficult to “browse”, automated methods for extract-
ing and distilling useful information from a large collection of audio
documents would enable users to more efficiently locate the specific
content of their interest. One specific task of interest is automatic
topic identification (or topic ID), for which the goal of a system is to
identify the topic(s) of each audio file in its collection. A variant of
the topic identification problem is the topic detection (or topic spot-
ting) problem, for which a system must detect which audio files in
its collection pertain to a specific topic.
Topic identification has been widely studied in both the text pro-
cessing and speech processing communities. The most common ap-
proach to topic identification for audio documents is to apply word-
based automatic speech recognition to the audio, and then process
the resulting recognized word strings using traditional text-based
topic identification techniques [1]. This approach has proven to work
effectively for tasks in which reasonably accurate speech recognition
performance is achievable (e.g. news broadcasts) [2]. Of course,
speech recognition errors can degrade topic identification perfor-
mance, and this degradation becomes more severe as the accuracy
of the speech recognizer decreases.
This work was sponsored by the Air Force Research Laboratory under
Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclu-
sions, and recommendations are those of the authors and are not necessarily
endorsed by the United States Government.
Despite previous successes, existing speech recognition systems
may not perform well enough to support accurate topic identi fication
for some tasks. Two common reasons for the inadequacy of a speech
recognition system are (1) a severe mismatch between the data used
to train the recognizer and the unseen data on which it is applied, and
(2) a dearth of training data that is well matched to the conditions
in which the recognizer is used. One problem that could manifest
itself, for example, is a mismatch between the vocabulary employed
by the recognizer and the topic-specific vocabulary used in the data
of interest. In the most extreme case, a recognition system may not
even be available in the language of the data of interest.
When training a topic identification system, one would ideally
possess a large corpus of transcribed data to help train both a speech
recognition system and a topic identification module. Unfortunately,
manual transcription of data is both costly and time-consuming. To
alleviate this cost, one could resort to a more rapid manual annota-
tion of available data in which audio content is only labeled by topic
and full lexical transcription is not performed. In this case, the de-
termination of relevant lexical items for topic identification can not
be determined from manual transcriptions, but instead must be de-
duced somehow from the acoustics of the speech signal. Towards
this end, several previous studies have investigated the use of pho-
netic speech recognizers (instead of word recognizers) in the devel-
opment of topic identification systems [3, 4, 5, 6].
In this paper, we empirically contrast topic identification sys-
tems using word-based speech recognition vs. phone-based speech
recognition. Furthermore, we investigate a variety of methods for
improving the performance of both word- and phone-based topic
identification. We begin by investigating a traditional Na¨ ıve Bayes
formulation of the problem. Within this formulation we examine
a variety of feature selection techniques required to optimize perfor-
mance of the approach. We also investigate a support vector machine
(SVM) approach which has previously been successfully applied to
the problems of speaker and language identification [7].
2. EXPERIMENTAL TASK DESCRIPTION
2.1. Corpus
For the data set for our experiments we have used the English Phase
1 portion of the Fisher Corpus [8, 9]. This corpus consists of 5851
recorded telephone conversations. During data collection, two peo-
ple were connected over the telephone network and given instruc-
tions to discuss a specific topic for 10 minutes. Data was collected
from a set of 40 different topics. The topics were varied and in-
cluded relatively distinct topics (e.g. “Movies”, “Hobbies”, “Edu-
cation”, etc.) as well as topics covering similar subject areas (e.g.
“Issues in Middle East”, “Arms Inspections in Iraq”, “Foreign Re-
lations”). Fixed prompts designed to elicit discussion on the topics
659 978-1-4244-1746-9/07/$25.00 ©2007 IEEE ASRU 2007