TOPIC IDENTIFICATION FROM AUDIO RECORDINGS USING WORD AND PHONE RECOGNITION LATTICES Timothy J. Hazen, Fred Richardson and Anna Margolis MIT Lincoln Laboratory Lexington, Massachusetts, USA ABSTRACT In this paper, we investigate the problem of topic identication from audio documents using features extracted from speech recognition lattices. We are particularly interested in the difcult case where the training material is minimally annotated with only topic labels. Under this scenario, the lexical knowledge that is useful for topic identication may not be available, and automatic methods for ex- tracting linguistic knowledge useful for distinguishing between top- ics must be relied upon. Towards this goal we investigate the prob- lem of topic identication on conversational telephone speech from the Fisher corpus under a variety of increasingly difcult constraints. We contrast the performance of systems that have knowledge of the lexical units present in the audio data, against systems that rely en- tirely on phonetic processing. Index TermsAudio document processing, topic identica- tion, topic spotting. 1. INTRODUCTION As new technologies increase our ability to create, disseminate, and locate media, the need for automatic processing of these media also increases. Spoken audio data in particular is a media which could benet greatly from automatic processing. Because audio data is notoriously difcult to “browse”, automated methods for extract- ing and distilling useful information from a large collection of audio documents would enable users to more efciently locate the specic content of their interest. One specic task of interest is automatic topic identication (or topic ID), for which the goal of a system is to identify the topic(s) of each audio le in its collection. A variant of the topic identication problem is the topic detection (or topic spot- ting) problem, for which a system must detect which audio les in its collection pertain to a specic topic. Topic identication has been widely studied in both the text pro- cessing and speech processing communities. The most common ap- proach to topic identication for audio documents is to apply word- based automatic speech recognition to the audio, and then process the resulting recognized word strings using traditional text-based topic identication techniques [1]. This approach has proven to work effectively for tasks in which reasonably accurate speech recognition performance is achievable (e.g. news broadcasts) [2]. Of course, speech recognition errors can degrade topic identication perfor- mance, and this degradation becomes more severe as the accuracy of the speech recognizer decreases. This work was sponsored by the Air Force Research Laboratory under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclu- sions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government. Despite previous successes, existing speech recognition systems may not perform well enough to support accurate topic identi cation for some tasks. Two common reasons for the inadequacy of a speech recognition system are (1) a severe mismatch between the data used to train the recognizer and the unseen data on which it is applied, and (2) a dearth of training data that is well matched to the conditions in which the recognizer is used. One problem that could manifest itself, for example, is a mismatch between the vocabulary employed by the recognizer and the topic-specic vocabulary used in the data of interest. In the most extreme case, a recognition system may not even be available in the language of the data of interest. When training a topic identication system, one would ideally possess a large corpus of transcribed data to help train both a speech recognition system and a topic identication module. Unfortunately, manual transcription of data is both costly and time-consuming. To alleviate this cost, one could resort to a more rapid manual annota- tion of available data in which audio content is only labeled by topic and full lexical transcription is not performed. In this case, the de- termination of relevant lexical items for topic identication can not be determined from manual transcriptions, but instead must be de- duced somehow from the acoustics of the speech signal. Towards this end, several previous studies have investigated the use of pho- netic speech recognizers (instead of word recognizers) in the devel- opment of topic identication systems [3, 4, 5, 6]. In this paper, we empirically contrast topic identication sys- tems using word-based speech recognition vs. phone-based speech recognition. Furthermore, we investigate a variety of methods for improving the performance of both word- and phone-based topic identication. We begin by investigating a traditional Na¨ ıve Bayes formulation of the problem. Within this formulation we examine a variety of feature selection techniques required to optimize perfor- mance of the approach. We also investigate a support vector machine (SVM) approach which has previously been successfully applied to the problems of speaker and language identication [7]. 2. EXPERIMENTAL TASK DESCRIPTION 2.1. Corpus For the data set for our experiments we have used the English Phase 1 portion of the Fisher Corpus [8, 9]. This corpus consists of 5851 recorded telephone conversations. During data collection, two peo- ple were connected over the telephone network and given instruc- tions to discuss a specic topic for 10 minutes. Data was collected from a set of 40 different topics. The topics were varied and in- cluded relatively distinct topics (e.g. “Movies”, “Hobbies”, “Edu- cation”, etc.) as well as topics covering similar subject areas (e.g. “Issues in Middle East”, “Arms Inspections in Iraq”, “Foreign Re- lations”). Fixed prompts designed to elicit discussion on the topics 659 978-1-4244-1746-9/07/$25.00 ©2007 IEEE ASRU 2007