VIDEO INDEXING THROUGH INTEGRATION OF SYNTACTIC AND SEMANTIC FEATURES* Bilge Giinsel, A. Miifit Ferman, and A. Murat Tekalp Dept. of Electrical Engineering and Center for Electronic Imaging Systems, University of Rochester, Rochester, NY zyxwv 14627 E-mail: zyxwvuts { gunsel, ferman, tekalp }@ee.rochester.edu zyxw Abstract This paper proposes a content-based video indexing sys- tem which provides the functionalities necessary for au- tomatic management of video data through integration of syntactic and semantic features. The proposed system has been applied to detection, classification and then indexing of news programs collected from different TV channels. zyxwvutsrq Al- though the paper focuses on news programs, the same meth- ods can be used to content-based index and search other TV programs with distinct semantic structure. 1 Introduction Content-based access to data has evolved zyxwvu as a fundamental requirement of all networked multime- dia applications, including video-on-demand, news-on- demand and interactive video. The ultimate goal of this work is to create an automatic system that in- dexes TV programs and allows selective retrieval of news items by content-based queries. Early video- or news-on-demand systems mostly employ manual par- titioning of video into clips, where keywords or text are associated with each clip for indexing [l], [2]. Although useful for some purposes, it is generally agreed that keyword-based techniques cannot always adequately represent the semantic information in video. The re- cent research efforts on automatic/content-based rep- resentation deal with segmentation of video into shots using mostly %yntactic” (bottom-up)[3], [4], [5], [6] methods and some of them use “semantic” (domain- dependent) methods as well [3]. One of the first syn- tactic approaches was proposed by Swanberg et aZ.[6], extracts two basic index units of a news program (news items and camera shots) based on the news episode, then classifies each item into a coarse cate- gory, and specifies key frames to represent the visual *This work is supported by a National Science Foundation SIUCRC grant and a New York State Science and Technology Foundation grant to the Center for Electronic Imaging Systems at the University of Rochester. 90 content of each news item; however, no experimen- tal results have been reported. A more recent study on knowledge-guided video content parsing algorithms for indexing news programs was published by Zhang et al. [3], [5]. Their model first segments the video data into shots, and locates candidate anchorperson shots. These candidate shots are then classified by region- based model matching. They have proposed [5] the use of two thresholds and the twin-comparison method to detect cuts and gradual transitions from color his- togram differences between successive frames. This scene change detection method is efficient, but requires a second pass over the sequence, which increases the computational burden, and the selection of appropri- ate thresholds is usually application-dependent. The candidate anchorperson shot selection also depends on the thresholds and localization of the news programs within the video stream takes a long time since each step has a high computational complexity. This paper offers a practical solution to automatic temporal video segmentation and indexing by integra- tion of syntactic and semantic techniques for news- on-demand. A TV news program is a good example of video with a distinct structural model; the tempo- ral syntax (episode) of a news video is usually very straightforward-a sequence of news items interleaved with commercials. Each news item includes anchor- person shot(s) at its beginning and/or end, followed by relevant news footage. We propose a video index- ing system which consists of three processing mod- ules: temporal segmentation, classification, and in- dexing; each one is described in Sections 2, 3 and 4, respectively. The input of the system is digitized video data, while the output contains key frames repre- senting news units including segmented/classified news shots. Along with this content-based video data meta- data [7] is also provided at the output. The proposed system aims to answer such queries as “find news pro- grams in a TV broadcast video,” “skip to next news item,” “find the weather report,” “separate all com- 0-8186-7620-5196 $5.00 zyxwvutsrq 0 1996 IEEE