Video Summarization at Brno University of Technology Vítzslav Beran, Michal Hradiš, Adam Herout, Stanislav Sumec, Igor Potúek, Pavel Zemík, Josef Mlích, Aleš Láník, Petr Chmela Brno University of Technology Faculty of Information Technology Department of Computer Graphics and Multimedia Božetchova 2, 612 66 Brno, CZ {beranv, herout, sumec, potucek, zemcik, chmelarp}@fit.vutbr.cz {xhradi05, xmlich02, xlanik00}@stud.fit.vutbr.cz ABSTRACT This paper describes the video summarization system built for the TRECVID 2007 evaluation by the Brno team. Motivations for the system design and its overall structure are described followed by more detailed description of the critical parts of the system, which are feature extraction and clustering of frames (shots, sub-shots) in time domain. Many ideas were not included into the system because of the time constraints. Those considered promising are stated and briefly described as possible future work. The results of video summarization presented in this paper can be considered to be a humble success and can encourage further development in the field. This is specifically true as not all the features that can be considered and processing methods were implemented in the evaluated system. Categories and Subject Descriptors I.5.3 [Pattern recognition]: Clustering General Terms Algorithms, Similarity measures. Keywords Video, summarization, image features, time compression, TRECVID evaluation. 1. INTRODUCTION Contemporary technology makes possible to acquire huge sets of video content e.g. from TV broadcasting, meeting rooms, security systems etc. Such data can be further reused for various purposes. However, searching of desired information within large video libraries is time consuming. It becomes necessary to give users summarizing and skimming tools, which allow speeding up this process. These tools should produce shortened versions of source videos with regard to the information content. Various methods for creating of summarizing videos have been already proposed. One class of techniques is based on time compression. The playback rate of audio and video is speed up with almost no pitch distortion. However, these techniques are limited to relatively low saving factor around 1.5 – 2.5 depending on speech speed. Slightly better results can be achieved when silent intervals are completely removed. Different techniques generate a static storyboard of images which are selected according to information contained in video or audio tracks. This paper describes the system for creating video summaries based on an identification of similar clips. The best representative clip from every group is selected and inserted into the final video. Further, the resulting summary is formatted with additional information, which helps to localize other occurrences of presented clip. 2. SYSTEM OVERVIEW Different purposes of the resulting videos would call for different summarization methods. The presented work targets summarization for professionals who need to deal with a number of relatively long video records. The resulting video should then cover parts of the original recording, representing preferably all different flavors of shots. Therefore the resulting video is not supposed to contain the most interesting scenes, the most dynamic ones, or those with closest relationship to the “story”, etc. Also the selected approach does not take into account any understanding of the semantic meaning of the separate shots. Automatic semantic understanding is at the moment not possible and the system is supposed to work for unknown videos, where some semi-automatic or guided approach would be possible. The scheme of the video summarizing system is on Figure 1. The input video frames are described using preferred image features and classified – shot boundary and wanted/unwanted frame. The rough video is divided into short shots that are described and classified similarly as frames and finally clustered. Representative shots are combined to the final video according to the layout setup. With the targeted purpose in mind, notable effort was invested in the resulting video layout – the output is not simply a sequence of (shortened) shots of the original video, but the actual video is played in a (though large) window, and is accompanied by textual Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.