2704 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 6, JUNE 2014
Heterogeneity Image Patch Index and Its
Application to Consumer Video Summarization
Chinh T. Dang, Student Member, IEEE, and Hayder Radha, Fellow, IEEE
Abstract—Automatic video summarization is indispensable for
fast browsing and efficient management of large video libraries.
In this paper, we introduce an image feature that we refer to as
heterogeneity image patch (HIP) index. The proposed HIP index
provides a new entropy-based measure of the heterogeneity of
patches within any picture. By evaluating this index for every
frame in a video sequence, we generate a HIP curve for that
sequence. We exploit the HIP curve in solving two categories
of video summarization applications: key frame extraction and
dynamic video skimming. Under the key frame extraction frame-
work, a set of candidate key frames is selected from abundant
video frames based on the HIP curve. Then, a proposed patch-
based image dissimilarity measure is used to create affinity matrix
of these candidates. Finally, a set of key frames is extracted
from the affinity matrix using a min–max based algorithm.
Under video skimming, we propose a method to measure the
distance between a video and its skimmed representation. The
video skimming problem is then mapped into an optimization
framework and solved by minimizing a HIP-based distance for a
set of extracted excerpts. The HIP framework is pixel-based and
does not require semantic information or complex camera motion
estimation. Our simulation results are based on experiments
performed on consumer videos and are compared with state-of-
the-art methods. It is shown that the HIP approach outperforms
other leading methods, while maintaining low complexity.
Index Terms— Video summarization, heterogeneity image
patch index, the discrete Fréchet distance, consumer videos.
I. I NTRODUCTION
T
HE massive growth of digital video content demands
effective techniques for fast browsing and efficient man-
agement of data. Video summarization provides tools for
selecting the most informative sequences of still or moving
pictures that help users quickly glance through the whole video
clip in a constrained amount of time. Generally speaking, there
are two categories of video summarization:
• Key frames or static story board: a collection of salient
images or key frames extracted from video.
• Dynamic video skimming or a preview sequence: a col-
lection of essential video segments or excerpts (key video
Manuscript received August 30, 2013; revised January 6, 2014 and
March 26, 2014; accepted April 15, 2014. Date of publication April 29,
2014; date of current version May 13, 2014. This work was supported in
part by the National Science Foundation under Grant CCF-1117709 and in
part by the Vietnam Education Foundation Fellowship. The associate editor
coordinating the review of this manuscript and approving it for publication
was Prof. Carlo S. Regazzoni.
The authors are with the Department of Electrical and Computer Engineer-
ing, Michigan State University, East Lansing, MI 48824-1226 USA (e-mail:
dangchin@egr.msu.edu; radha@egr.msu.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2014.2320814
excerpts) and the corresponding audio, which are joined
together to become a much shorter version of the original
video content.
A set of key frames has many important roles in intel-
ligent video management systems such as video retrieval
and browsing, navigation, indexing, and prints from video.
It helps to reduce computational complexity since the system
could work with a set of representative frames instead of
the whole video sequence. Key frames capture both the
temporal and spatial information of the video sequence, and
hence, they enable rapid viewing [1]. Conventional key frame
extraction approaches can be loosely divided into two groups:
(i) shot-based and (ii) segment-based. In shot-based key frame
extraction, the shots of the original video are first detected,
and then one or more key frames are extracted from each shot
[2]–[4]. In segment-based key frame extraction approaches,
a video is segmented into higher-level video components,
where each segment or component could be a scene, an event,
a set of one or more shots, or even the entire video sequence.
Representative frame(s) from each segment are then selected
as the key frames [5], [6].
The second type of video summarization, dynamic video
skimming, contains both audio and visual motion elements.
Therefore, it is typically more appealing for users than viewing
a series of still key frames only. Video skimming, however, is
a relatively new research area and normally requires high-level
semantic analysis [1]. Several approaches for skimming range
from basic extension of key frame extraction (as an initial
step and then considering each frame as the middle frame
of a fixed-length excerpt) to more advanced methods such as
integrating motion metadata to reconstruct an excerpt [8]. Var-
ious features have been extensively used for video skimming
generation; these features include text, audio, camera motion,
and other visual features such as color histogram, edge, and
texture [9]–[11].
The main contributions of this paper include:
1) We propose a new patch based image/video analysis
approach. Using the new model, we create a new
feature that we refer to as the heterogeneity image
patch (HIP) index of an image or a video frame.
The HIP index, which is evaluated using patch-based
image/video analysis, provides a measure for the level
of heterogeneity (and hence the amount of redun-
dancy) that exists among patches of an image/video
frame.
2) By measuring the HIP index for each video frame, we
generate a HIP curve that becomes a characteristic curve
1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.