Incorporating Audio Cues into Dialog and Action Scene Extraction Lei Chen † , Shariq J. Rizvi ‡ * and M. Tamer ¨ Ozsu † † School of Computer Science, University of Waterloo, Waterloo, Canada Email:{l6chen, tozsu}@uwaterloo.ca ‡ Computer Science and Engineering Department, Indian Institute of Technology, Mumbai, India Email: rizvi@cse.iitb.ac.in ABSTRACT In this paper, we present an approach to extract scenes in video. The approach is top-down and uses video editing rules and audio cues to extract simple dialog and action scenes. The underlying model is a ﬁnite state machine coupled with audio cues that are determined using an audio classiﬁer. Keywords: shot, scene, editing rules, ﬁnite state machine, support vector machine, audio classiﬁcation 1. INTRODUCTION The increasing availability and use of video has raised demands for better modeling of video and the provision of more sophisticated indexing and retrieval techniques. However, compared to text or images, video data are much more complicated. A one minute movie clip may contain about 2,000 video frames (image), a mixture of three types of sounds (audio), and several lines of close caption (text). How to eﬃciently represent and index video data remains a challenging problem. Early video database systems segment video into shots, 1–3 and extract key frames from each shot to represent it. 4–6 Such systems have been criticized for three reasons: • The number of shots becomes very large with the growth of video data, which makes the data diﬃcult to browse; • A simple shot does not convey much semantics, since it is produced by a single camera operation. • Using key frames may ignore temporal characteristics of the video. There have been several attempts 7–10 to cluster semantically related shots into scenes. All the scene construction algorithms follow similar steps: 1. Visual features are extracted from shots, such as color histograms, textures and shapes. 2. Shots are clustered based on a similarity measure which is computed from the extracted visual features. 3. Clusters that are temporally close to each other are grouped into scenes. All of these approaches use a “bottom-up” strategy, clustering the shots into “general” scenes without any knowledge about the semantics and structure of the scenes. However, they only employ low-level visual features, which may cause semantically unrelated shots to be clustered into one unit only because they may be “similar” in terms of their low-level visual features. Furthermore, users may not be interested in the “general” scenes constructed in this way, but may focus on particular scenes. In particular, dialog and action scenes have special importance in video, since they constitute basic “sentences” of a movie that consist of three basic types of scenes 11 : dialogs without action, dialogs with action, and actions without dialog. Automatic extraction of dialog and action scenes from a video is an important issue for practical use of video. There is another shortcoming of * Work performed while the author was visiting the University of Waterloo.