TOPIC IDENTIFICATION FROM AUDIO RECORDINGS USING WORD AND PHONE RECOGNITION LATTICES Timothy J. Hazen, Fred Richardson and Anna Margolis MIT Lincoln Laboratory Lexington, Massachusetts, USA ABSTRACT In this paper, we investigate the problem of topic identiﬁcation from audio documents using features extracted from speech recognition lattices. We are particularly interested in the difﬁcult case where the training material is minimally annotated with only topic labels. Under this scenario, the lexical knowledge that is useful for topic identiﬁcation may not be available, and automatic methods for ex- tracting linguistic knowledge useful for distinguishing between top- ics must be relied upon. Towards this goal we investigate the prob- lem of topic identiﬁcation on conversational telephone speech from the Fisher corpus under a variety of increasingly difﬁcult constraints. We contrast the performance of systems that have knowledge of the lexical units present in the audio data, against systems that rely en- tirely on phonetic processing. Index Terms— Audio document processing, topic identiﬁca- tion, topic spotting. 1. INTRODUCTION As new technologies increase our ability to create, disseminate, and locate media, the need for automatic processing of these media also increases. Spoken audio data in particular is a media which could beneﬁt greatly from automatic processing. Because audio data is notoriously difﬁcult to “browse”, automated methods for extract- ing and distilling useful information from a large collection of audio documents would enable users to more efﬁciently locate the speciﬁc content of their interest. One speciﬁc task of interest is automatic topic identiﬁcation (or topic ID), for which the goal of a system is to identify the topic(s) of each audio ﬁle in its collection. A variant of the topic identiﬁcation problem is the topic detection (or topic spot- ting) problem, for which a system must detect which audio ﬁles in its collection pertain to a speciﬁc topic. Topic identiﬁcation has been widely studied in both the text pro- cessing and speech processing communities. The most common ap- proach to topic identiﬁcation for audio documents is to apply word- based automatic speech recognition to the audio, and then process the resulting recognized word strings using traditional text-based topic identiﬁcation techniques [1]. This approach has proven to work effectively for tasks in which reasonably accurate speech recognition performance is achievable (e.g. news broadcasts) [2]. Of course, speech recognition errors can degrade topic identiﬁcation perfor- mance, and this degradation becomes more severe as the accuracy of the speech recognizer decreases. This work was sponsored by the Air Force Research Laboratory under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclu- sions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government. Despite previous successes, existing speech recognition systems may not perform well enough to support accurate topic identi ﬁcation for some tasks. Two common reasons for the inadequacy of a speech recognition system are (1) a severe mismatch between the data used to train the recognizer and the unseen data on which it is applied, and (2) a dearth of training data that is well matched to the conditions in which the recognizer is used. One problem that could manifest itself, for example, is a mismatch between the vocabulary employed by the recognizer and the topic-speciﬁc vocabulary used in the data of interest. In the most extreme case, a recognition system may not even be available in the language of the data of interest. When training a topic identiﬁcation system, one would ideally possess a large corpus of transcribed data to help train both a speech recognition system and a topic identiﬁcation module. Unfortunately, manual transcription of data is both costly and time-consuming. To alleviate this cost, one could resort to a more rapid manual annota- tion of available data in which audio content is only labeled by topic and full lexical transcription is not performed. In this case, the de- termination of relevant lexical items for topic identiﬁcation can not be determined from manual transcriptions, but instead must be de- duced somehow from the acoustics of the speech signal. Towards this end, several previous studies have investigated the use of pho- netic speech recognizers (instead of word recognizers) in the devel- opment of topic identiﬁcation systems [3, 4, 5, 6]. In this paper, we empirically contrast topic identiﬁcation sys- tems using word-based speech recognition vs. phone-based speech recognition. Furthermore, we investigate a variety of methods for improving the performance of both word- and phone-based topic identiﬁcation. We begin by investigating a traditional Na¨ ıve Bayes formulation of the problem. Within this formulation we examine a variety of feature selection techniques required to optimize perfor- mance of the approach. We also investigate a support vector machine (SVM) approach which has previously been successfully applied to the problems of speaker and language identiﬁcation [7]. 2. EXPERIMENTAL TASK DESCRIPTION 2.1. Corpus For the data set for our experiments we have used the English Phase 1 portion of the Fisher Corpus [8, 9]. This corpus consists of 5851 recorded telephone conversations. During data collection, two peo- ple were connected over the telephone network and given instruc- tions to discuss a speciﬁc topic for 10 minutes. Data was collected from a set of 40 different topics. The topics were varied and in- cluded relatively distinct topics (e.g. “Movies”, “Hobbies”, “Edu- cation”, etc.) as well as topics covering similar subject areas (e.g. “Issues in Middle East”, “Arms Inspections in Iraq”, “Foreign Re- lations”). Fixed prompts designed to elicit discussion on the topics 659 978-1-4244-1746-9/07/$25.00 ©2007 IEEE ASRU 2007