Abstract We present a search interface for large video collections with time-aligned text transcripts. The system is designed for users such as intelligence analysts that need to quickly find video clips relevant to a topic expressed in text and images. A key component of the system is a powerful and flexible user interface that incorporates dynamic visualiza- tions of the underlying multimedia objects. The interface displays search results in ranked sets of story keyframe col- lages, and lets users explore the shots in a story. By adapting the keyframe collages based on query relevance and indicat- ing which portions of the video have already been explored, we enable users to quickly find relevant sections. We tested our system as part of the NIST TRECVID interactive search evaluation, and found that our user interface enabled users to find more relevant results within the allotted time than those of many systems employing more sophisticated analy- sis techniques. Categories & Subject Descriptors: H.5.1 [Information interfaces and presentation]: Multimedia information sys- tems – video. General Terms: Algorithms; Design; Human Factors. Keywords: Video search, keyframe collages, text analysis. INTRODUCTION While searching text documents is a well-studied process, it is less clear how to best support search in video collections. Typically text documents can be treated as units for the pur- pose of retrieval. However, treating whole videos as units will often not lead to satisfactory results. This is true in the case of news videos where a 30-minute news program is broken up into stories of one or two minutes in length. Our approach to this problem is to support users in rapidly searching through such video collections. Our target users are analysts who want to combine information from several sources or video producers who want to locate video for reuse. While the latter will frequently use libraries with extensive meta-data to support retrieval, our goal is to sup- port the search in video collections where such meta-data is not available. We assume that time-aligned text, such as transcripts, automatically recognized speech, or closed cap- tions, is available. To validate our approach, we participated in this year’s interactive search component of a video retrieval evaluation called TRECVID sponsored by the National Institute of Standards and Technology (NIST) [9]. In the interactive search, participants have access to four months worth of broadcast news video from the U.S. ABC and CNN net- works (about 60 hours). Participants are asked to answer questions such as “find shots of Bill Clinton speaking with at least part of a US flag visible behind him.” Some of the TRECVID participants use very elaborate video analysis techniques to support the search [4]. For example, one very successful system allows the user to search for visual fea- tures such as animals, buildings, or people [2]. Our system design philosophy is to automate parts of the system but to let the users directly perform tasks that they can do better. For this application, the system and the users collaborate to improve the information retrieval precision and recall. Precision is the fraction of relevant retrieved doc- uments among all documents retrieved and recall is the frac- tion of relevant retrieved documents among all possible rel- evant documents. Our system works to maximize the information retrieval recall without compromising preci- sion. The users are mostly responsible for the precision by browsing through the visually presented candidates and by selecting the truly relevant ones. The system performs a sec- ond automation step after the interactive session to supple- ment the user-selected shots with additional search results that are deemed similar. Copyright is held by the author/owner(s). CHI 2005, April 2–7, 2005, Portland, Oregon, USA. ACM 1-59593-002-7/05/0004. A B C F E D G Figure 1: The interactive search interface. (A) Story keyframe summaries in the search results (B) Search text and image entry (C) TRECVID topic display (D) Media player and keyframe zoom (E) Story timeline (F) Shot keyframes (G) Relevant shot list Interactive Search in Large Video Collections Andreas Girgensohn, John Adcock, Matthew Cooper, and Lynn Wilcox FX Palo Alto Laboratory 3400 Hillview Avenue, Bldg. 4 Palo Alto, CA 94304, USA {andreasg, adcock, cooper, wilcox}@fxpal.com