Speech Retrieval with Video Parsing for Television News Programs Helen M. Meng 1 , Xiaoou Tang 2 , Pui Yu Hui 1 , Xinbo Gao 2 and Yuk Chi Li 1 1 Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, 2 Department of Information Engineering, The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR, China hmmeng@se.cuhk.edu.hk ABSTRACT We have been working on speech retrieval from Chinese (Cantonese) television news programs. The use of automatic speech recognition for audio indexing produces imperfect transcriptions, and recognition errors affect retrieval performance. A news story typically contains a brief report by the anchor person(s) in the studio, as well as news footage from the field. Investigation shows that our recognizer performs better when indexing audio from the studio, compared to that from the field. In order to automatically extract the "reliable" audio segments for speech retrieval, we attempt to detect studio-to-field transitions by means of video parsing. Our study is based on 146 news stories collected from local television Cantonese news programs. We formulated a known-item retrieval task and adopted the average inverse rank (AIR) as our evaluation metric. Retrieval is performed based on syllable bigram units, augmented with skipped syllable bigrams. Retrieval using the entire audio track of each news story gave AIR=0.759. With the incorporation of video parsing, we performed retrieval based only on the studio recordings, which produced AIR=0.768. 1. INTRODUCTION The explosive growth of the Internet has created a rich source of electronic information in a variety of media – text, audio and video. This creates a demand for multilingual and multimedia information retrieval technologies to enable the user to retrieve personally relevant content on demand. Text-based search engines are widely used, and audio / video searching are active areas of research. We have been working on the problem of Chinese spoken document retrieval [Meng et al., 2000]. In particular, we work with Cantonese, which is a major dialect of the Chinese language, commonly used in Hong Kong, Macau, South China and many overseas Chinese communities. This work attempts to apply the video parsing technique to assist our Cantonese spoken document retrieval task, based on television news programs. We combine the technologies of speech recognition and video parsing for indexing our audio tracks, and applied a vector-space model for retrieval. Previous work in this area include Mandarin (the major dialect of Chinese) spoken document retrieval by [Chien et al., 1999] and [Wang et al., 1999]; and the CMU Informedia project [Wactlar et al., 1996] which uses image and audio information concurrently for digital video access. 2. CORPORA Video content for our experiments is provided by the Hong Kong Television Broadcasts Ltd. (TVB). It consists of Cantonese news broadcasts from the Jade 1 channel (i.e. the Cantonese channel), with 146 news stories, for which Table 1 provides some detailed information. 1 http://www.tvb.com.hk/news Language Cantonese Chinese Source TVB Jade channel Number of Stories 146 (~3.11 hours) Extraction Period 7-9, July 1999 and 5-17, October 2000 Average Length of News 1 min 15.62 sec (per story) Minimum Length of News 7.13 sec Maximum Length of News 4 min 0.2 sec Digital Video Format MPEG-1 Table 1. Information about the video content used on our experiments. Figure 1. The temporal structure of a news program. Each MPEG file contains a single news story manually segmented from the news program, which is illustrated in Figure 1. Each story is accompanied with a brief textual summary and its title. However, the summary is not a verbatim transcription of the audio track of the video file. We estimated that the length of the textual summary is roughly a quarter that of the audio track, measured in the number of characters / syllables. 2 The average length of the summary titles is 17.5 characters. Table 2 shows an example of the textual summary of a news story, together with its title (underlined). Table 2. An example of the textual summary of a news story. The summary title is underlined. Very often, the news story begins with a report from the anchor(s) in the studio, followed by a live report from the field. The anchor reports are primarily studio-quality in Cantonese. Live reports are mainly spontaneous speech (e.g. from 2 Written Chinese consists of a sequence of characters. Each character is pronounced as a syllable.