MAJOR CAST DETECTION IN VIDEO USING BOTH AUDIO AND VISUAL INFORMATION Zhu Liu AT&T Labs - Research Room A5 4F04, 200 Laurel Ave. South Middletown, NJ 07748 zliu@research.att.com Yao Wang Department of Electrical Engineering Polytechnic University Brooklyn, NY, 11201 yao@vision.poly.edu ABSTRACT Major casts, for example, the anchor persons or reporters in news broadcast programs and principle characters in movies play an important role in video, and their occurrences provide good in- dices for organizing and presenting video content. This paper de- scribes a new approach for automatically generating the list of ma- jor casts in a video sequence based on multiple modalities, specif- ically, both speaker and face information. A list of major casts is created and ordered by the accumulative temporal and spatial presence of corresponding casts. Preliminary simulation results show that the detected major casts are meaningful and the pro- posed approach is promising. 1. INTRODUCTION With huge amount of video data generated daily, it is indispensable for a video creator or distributor to provide content description for browsing and retrieval capability. While low level content descrip- tors including camera shot changes, speech or music boundaries, etc. are useful, they can not provide semantically meaningful in- dices. Higher level content based abstract is more desirable to help the users to grasp the synopsis effectively. Major casts, for exam- ple the anchor persons or reporters in news programs and principal characters in movies play an important role, and their occurrences provide good indices for organizing and presenting video content. The users may easily digest the main scheme of a video by skim- ming through clips associated with major casts. Because manual content annotation is time consuming and sometimes inconsistent, many research efforts have been involved to automate this procedure. Most of the previous works are focus- ing on utilizing one type of modality, e.g. audio or visual alone, to tackle this problem. Zhang and Kuo [1] classified audio content in a hierarchical way. At the coarse level, audio data is classified into speech, music, environmental sounds, and silence, and at the fine level, environmental sounds are further classified into applause, rain, etc. Rui et al. [2] explored the automatic extraction of video structures from both the physical shots and the semantic scenes and developed tools that can construct table of content (TOC) to assist user’s access. Since the semantics of video data are embed- ded in multiple forms that are usually complimentary to each other, we need to analyze all available media simultaneously. Saraceno and Leonardi [3] considered segmenting a video into the follow- ing basic scene types: dialogs, stories, actions, and generic. This is The author performed the work while he was at Polytechnic Univer- sity. This work was supported in part by the National Science Foundation through its STIMULATE program under Grant No. IRI-9619114. accomplished by first dividing a video into audio and visual shots independently, and then grouping video shots so that audio and visual characteristics within each group follow some predefined patterns. Huang et al. [4] proposed to generate content hierarchy for broadcast news programs by integrating audio, video, and text information simultaneously. This paper presents a new approach for automatically generat- ing a list of major casts for video based on both audio and visual information. In section 2, we illustrate the overall diagram of ma- jor cast detection algorithm. Speaker and face information extrac- tion is described in section 3. How to combine cues from different modalities and further detect major cast is explained in section 4. In section 5 we present and discuss some preliminary results, and finally in section 6, we draw our conclusion. 2. MAJOR CAST DETECTION DIAGRAM Video Sequence Audio Track Visual Track Clean Speech Extraction Level 1 Speaker Segmentation Video Shot Extraction Face Detection & Tracking Faces Speakers Level 2 Temporal Correlation Analysis Generation of Major Cast List Cast 1 Cast 2 Fig. 1. Major Cast Detection Algorithm Figure 1 illustrates the major cast detection algorithm we pro- posed. Each major cast is characterized by two attributes: face and speech. The detection procedure is to find corresponding face oc- currences and speech segments by analyzing video at two levels. Audio and visual information is utilized separately at low level, and at high level cues from different modalities are combined. At low level, video sequence is segmented independently in