2704 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014 Heterogeneity Image Patch Index and Its Application to Consumer Video Summarization Chinh T. Dang, Student Member, IEEE, and Hayder Radha, Fellow, IEEE Abstract—Automatic video summarization is indispensable for fast browsing and efﬁcient management of large video libraries. In this paper, we introduce an image feature that we refer to as heterogeneity image patch (HIP) index. The proposed HIP index provides a new entropy-based measure of the heterogeneity of patches within any picture. By evaluating this index for every frame in a video sequence, we generate a HIP curve for that sequence. We exploit the HIP curve in solving two categories of video summarization applications: key frame extraction and dynamic video skimming. Under the key frame extraction frame- work, a set of candidate key frames is selected from abundant video frames based on the HIP curve. Then, a proposed patch- based image dissimilarity measure is used to create afﬁnity matrix of these candidates. Finally, a set of key frames is extracted from the afﬁnity matrix using a min–max based algorithm. Under video skimming, we propose a method to measure the distance between a video and its skimmed representation. The video skimming problem is then mapped into an optimization framework and solved by minimizing a HIP-based distance for a set of extracted excerpts. The HIP framework is pixel-based and does not require semantic information or complex camera motion estimation. Our simulation results are based on experiments performed on consumer videos and are compared with state-of- the-art methods. It is shown that the HIP approach outperforms other leading methods, while maintaining low complexity. Index Terms— Video summarization, heterogeneity image patch index, the discrete Fréchet distance, consumer videos. I. I NTRODUCTION T HE massive growth of digital video content demands effective techniques for fast browsing and efﬁcient man- agement of data. Video summarization provides tools for selecting the most informative sequences of still or moving pictures that help users quickly glance through the whole video clip in a constrained amount of time. Generally speaking, there are two categories of video summarization: • Key frames or static story board: a collection of salient images or key frames extracted from video. • Dynamic video skimming or a preview sequence: a col- lection of essential video segments or excerpts (key video Manuscript received August 30, 2013; revised January 6, 2014 and March 26, 2014; accepted April 15, 2014. Date of publication April 29, 2014; date of current version May 13, 2014. This work was supported in part by the National Science Foundation under Grant CCF-1117709 and in part by the Vietnam Education Foundation Fellowship. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Carlo S. Regazzoni. The authors are with the Department of Electrical and Computer Engineer- ing, Michigan State University, East Lansing, MI 48824-1226 USA (e-mail: dangchin@egr.msu.edu; radha@egr.msu.edu). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TIP.2014.2320814 excerpts) and the corresponding audio, which are joined together to become a much shorter version of the original video content. A set of key frames has many important roles in intel- ligent video management systems such as video retrieval and browsing, navigation, indexing, and prints from video. It helps to reduce computational complexity since the system could work with a set of representative frames instead of the whole video sequence. Key frames capture both the temporal and spatial information of the video sequence, and hence, they enable rapid viewing [1]. Conventional key frame extraction approaches can be loosely divided into two groups: (i) shot-based and (ii) segment-based. In shot-based key frame extraction, the shots of the original video are ﬁrst detected, and then one or more key frames are extracted from each shot [2]–[4]. In segment-based key frame extraction approaches, a video is segmented into higher-level video components, where each segment or component could be a scene, an event, a set of one or more shots, or even the entire video sequence. Representative frame(s) from each segment are then selected as the key frames [5], [6]. The second type of video summarization, dynamic video skimming, contains both audio and visual motion elements. Therefore, it is typically more appealing for users than viewing a series of still key frames only. Video skimming, however, is a relatively new research area and normally requires high-level semantic analysis [1]. Several approaches for skimming range from basic extension of key frame extraction (as an initial step and then considering each frame as the middle frame of a ﬁxed-length excerpt) to more advanced methods such as integrating motion metadata to reconstruct an excerpt [8]. Var- ious features have been extensively used for video skimming generation; these features include text, audio, camera motion, and other visual features such as color histogram, edge, and texture [9]–[11]. The main contributions of this paper include: 1) We propose a new patch based image/video analysis approach. Using the new model, we create a new feature that we refer to as the heterogeneity image patch (HIP) index of an image or a video frame. The HIP index, which is evaluated using patch-based image/video analysis, provides a measure for the level of heterogeneity (and hence the amount of redun- dancy) that exists among patches of an image/video frame. 2) By measuring the HIP index for each video frame, we generate a HIP curve that becomes a characteristic curve 1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.