Top lit Labeling of Broadcast News Stories in the lnformedia Digital Video Library Alexander G. Hauptmann and Danny Lee Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 152 13-3890, USA Tel: 1-412-268-1448 E-mail: { alex,dlee} @cs.cmu.edu ABSTRACT This paper describes the implementation of a topic labeling component for the Informedia Digital Video Library. Each news story recorded from the evening news is assigned to one of 3178 topic categories using a K-nearest neighbor classification algorithm. In preliminary tests, the system achieved recall of 0.49 1 with relevance of 0.482 when up to 5 topics could be assigned to a news story. KEYWORDS: Topic detection and labeling, topic spotting and classification, video library, digital libraries, broadcast news story indexing. INTRODUCTION The Informedia Digital Library Project [ 1,2] allows full content indexing and retrieval of text, audio and video material. By integrating technologies from the fields of natural language understanding, image processing, speech recognition and video compression, the Informedia digital video library system allows comprehensive access to multimedia data. News-on-Demand is a particular collection in the Informedia Digital Library that has served as a test- bed for automatic library creation techniques. As of March 1998, the Informedia project had about 1.2 terabytes of news video indexed and accessible online, with 1052 news broadcasts containing 21554 stories. The Informedia digital video library system has two distinct subsystems: the Library Creation System and the Library Exploration Client. The library creation system runs every night, automatically capturing, processing and adding current news shows to the library. It is during the library creation phase, that topics for news stories are automatically assigned to incoming stories. The user can later browse these stories and topics using the library exploration client permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Digital Libraries 98 Pittsburgh PA USA Copyright ACM 1998 O-89791-965--3/981 6...%5.00 Topics in IDVL While the original Informedia system allows a search of the full transcript text associated with audio portion of the video, until now, no attempt had been made to classify the news stories into topic categories. Users of the system repeatedly expressed the desire that the large amount of available data should be categorized to aid in understanding the corpus and searching it effectively. Related Research on Topic Detection The work reported here is similar in spirit to an approach reported by Schwartz [4], who classifies news stories into a static set using a Hidden Markov Model approach and found that to be somewhat better than a na’ive Bayesian approach. Yang [7] also reports on other techniques, which try to cluster news stories into stories of similar topic content. This work differs in that the topic categories here are defined a priori, and do not change over with different data sets. We felt this would better reflect the user needs, than a clustering approach, which could yield different clusters on different days, depending on the contents of the corpus. DATA The data for the experiment reported here came from a set of CD-ROMs of broadcast news transcripts, published by Primary Source Media [ 81. These data were used for training the system, and a separate held-out set was used for the evaluation results reported below. The online Informedia system uses actual broadcast video, for which no manual topic labels are available, however, the data is of the same type as on the CD-ROM. From this CDROM, we used 34671 news stories from 1995 as training data. Each of the news stories had one or more topic labels associated with it. Of these topic labels, we selected the top 3 178 unique topics, which occurred at least 10 times in the whole corpus. Topics with fewer instances were viewed as idiosyncratic and ignored in the experiments. For testing the accuracy of the topic assignment, 1181 1 news stories from 1996 up to April were used. A typical story is given in the following paragraph: “Gossip columnists in Hong Kong have been deprived of juice for the past three days. Hong Kong k Performing Artists Guild decided 287