A Framework for Segmentation of Interview Videos Omar Javed, Sohaib Khan, Zeeshan Rasheed, Mubarak Shah Computer Vision Lab School of Electrical Engineering and Computer Science University of Central Florida Orlando, FL 32816 {ojaved, khan, zrasheed, shah}@cs.ucf.edu Abstract In this paper, we present a method to remove commercials from interview videos, and to segment interviews into host or interviewee shots. In our approach, we mainly rely on information contained in shot transitions, rather than analyzing the scene content of individual frames. We utilize the inherent differences in scene structure of commercials and interviews to differentiate between them. Similarly, we make use of the well-defined structure of interviews, which can be exploited to classify shots as questions or answers. The entire show is first segmented into camera shots based on color histogram. Then, we construct a data-structure (shot connectivity graph) which links similar shots over time. Analysis of the shot connectivity graph helps us to automatically separate commercials from program segments. This is done by first detecting stories, and then assigning a weight to each story based on its likelihood of being a commercial. Further analysis on stories is done to distinguish shots of the interviewer from shots of the interviewees. We have tested our approach on several full-length Larry King shows (including commercials) and have achieved video segmentation with high accuracy. The whole scheme is fast and works even on low quality video (160x120 pixel images at 5 Hz). Keywords: Video segmentation, video processing, digital library, story analysis, semantic structure of video, removing commercials from broadcast video, Larry King Live show 1. Introduction We live in the digital age. Pretty soon everything from TV shows to movies, documents, maps, books, music, newspapers, etc will be in the digital form. Storing videos in digital format removes the limitations of sequential access of video (for example forward and rewind buttons on a VCR). Videos may be more efficiently organized for browsing and retrieval by exploiting their semantic structure. Such structure consists of shots and groups of shots called stories. A story is one coherent section of a program or commercials. The ability to segment a video into stories gives the user the ability to browse using story structure, rather than just sequential access available in analog format tapes. In this paper, we consider one popular TV show, Larry King Live, which has been running for more than 15 years on CNN. We assume the entire collection of shows has been digitized, and address the problem of how to organize each show, so that it is suitable for browsing and retrieval. We consider the user may be interested to look at only interview segments without the commercials, or may want to view only clips which record the questions asked during the show, or may want to see only clips which record the answers of the interviewee. For example, the user might be motivated only to watch the questions, to get a summary of the topics discussed in a particular program. Interview videos are an important segment of news-broadcast networks. Interviews occur within regular news and as separate programs. A lot of popular prime-time programs are based heavily on the interview format, for example, Crossfire, talk shows etc. The algorithm presented in this paper, though tested only for Larry King Live show, is not specific for any program and can be applied to these other shows to study their structure. This should significantly improve the digital-organization of these shows for browsing and retrieval purposes. There has been lots of interest recently in video segmentation and automatic generation of digital libraries. The Informedia Project [1] at Carnegie Mellon University has spearheaded the effort to segment and automatically generate a database of news broadcasts every night. The overall system relies on multiple cues, like video, speech, close- captioned text and other cues. Alternately, some approaches rely solely on video cues for segmentation [2, 3, 4]. Such an approach reduces the complexity of the complete algorithm and does not depend on the availability of close-captioned text for good results. In this paper, we exploit the semantic structure of the show to not only separate the commercials from interview segments, but also to analyze the content of the show to detect host shots versus guest shots. All this is done using only video information and relying mainly on the information contained in shot transitions. No specific training is done for this