324 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 3, MARCH 2007
Semantic Home Photo Categorization
Seungji Yang, Sang-Kyun Kim, and Yong Man Ro, Senior Member, IEEE
Abstract—A semantic categorization method for generic home
photo is proposed. The main contribution of this paper is to
exploit a two-layered classification model incorporating camera
metadata with low-level features for multilabel detection. The
two-layered support vector machine (SVM) classifiers operate to
detect local and global photo semantics in a feed-forward way.
The first layer aims to predict likelihood of predefined local photo
semantics based on camera metadata and regional low-level visual
features. In the second layer, one or more global photo semantics
is detected based on the likelihood. To construct classifiers pro-
ducing a posterior probability, we use a parametric model to fit
the output of SVM classifiers to posterior probability. A concept
merging process based on a set of semantic-confidence maps is also
presented to cope with selecting more likelihood photo semantics
on spatially overlapping local regions. Experiment was performed
with 3086 photos that come from MPEG-7 visual core experiment
two official databases. Results showed that the proposed method
would much better capture multiple semantic meanings of home
photos, compared to other similar technologies.
Index Terms—Camera metadata, image classification, photo
album, support vector machine.
I. INTRODUCTION
T
HE GOAL of semantic image categorization is to dis-
cover the image semantics from a domain of some given
predefined concepts, such as building, waterside, landscape,
cityscape, and so forth. Recently, as it is affordable to keep
a complete digital record of one’s whole life, the need for
semantic categorization has been raised in both organizing
and managing personal photo collection for minimizing user’s
manual efforts.
Conventionally, many researches have advanced semantic
image indexing and categorization in recent decades [1]–[9].
They mostly focused on reducing the semantic gap between
low-level visual features and high-level semantic descrip-
tions, which are closer to human visual perception. Herein,
one primary tackling point is learning approach itself so that
classifier realizes minimal bound of error in real application.
In particular, statistical learning approaches, such as Bayesian
probability model [32], Markov random fields (MRF) [1], and
support vector machines (SVM) [2], have been successfully
employed for semantic categorization. A statistical learning
Manuscript received May 6, 2006; revised November 20, 2006. This paper
was recommended by Guest Editor E. Izquierdo.
S. Yang and Y. M. Ro are with the Information and Communications
University (ICU), 103-6 Daejeon, South Korea (e-mail: yangzeno@icu.ac.kr;
yro@icu.ac.kr).
S.-K. Kim is with the Samsung Advanced Institute of Technology (SAIT),
14-1 Gyeonggi, South Korea (e-mail: skkim@sait.samsung.co.kr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2007.890829
process commonly includes three steps: 1) observing a phe-
nomenon in the real world; 2) constructing a model of the
phenomenon; and 3) making predictions using the model, step
by step. A useful approach to build the model of a classifier
is to employ discriminative features besides low-level visual
features or any combination of both. J. Smith et al., in [7], has
proposed semantic image/video indexing using semantic model
vectors that are constructed from multiple low-level features
where the model vector stands for a set of numerical degrees of
strength in relation to different semantic meanings. For better
classification, in [2] and [9], spatial image context features have
been coupled with low-level features as well.
Human beings sense many levels of visual semantics in photo.
However, semantic labels to be discovered are generally lim-
ited in a specific domain according to application, due to uncer-
tainty and infinity of semantic knowledge of human beings. Al-
though semantic object segmentation has been implemented by
a wide range of approaches for last two decades [33]–[36], how
to detect multiple semantic concepts in image is still challenging
problem due to low performance and high computational cost.
The problems in semantic categorization can be simplified by
using multilayered rather than single-layered approach. Having
multiple layers in classification often help to solve a classical
image understanding problem that requires effective interaction
of high-level semantics and low-level features. The way human
beings perceive semantic knowledge of an image is hierarchical.
In other words, human beings firstly sense rough, rather simple
semantic objects, and then compound them to understand more
comprehensively detailed semantic meanings of the image. This
sensory mechanism can be imitated by a multilayered learning
way. Multilayered approach usually forms a specific hierarchy
of layers with one or more classifiers. A classifier in the lower
layer aims to capture simple semantic aspects by using low-level
features while a classifier in the higher layer interprets more
complex semantic aspects by using high-level semantic features.
Many researchers have employed the multilayered approaches
to semantic categorization [5], [6], [9], [13]–[15].
One state-of-the-art classification method is using SVM [10],
[11]. Many conventional classifiers have targeted empirical risk
minimization (ERM). But, ERM only utilizes the loss function
defined for a classifier and is equivalent to Bayesian decision
theory with a particular choice of prior. Thus, an ERM approach
often leads to an over-fitted classifier, i.e., classifier is usually
too much adapted only to training data. Unlike ERM, structural
risk minimization (SRM) minimizes generalization error. The
generalization error is bounded by the sum of training set error
and a term depending on Vapnik–Chervonenkis (VC) dimension
of the learning machine. High generalization can be archived
by minimizing the upper bound. SVM is based on the idea of
SRM. The generalization error of SVM is related not to the input
dimensionality of the problem, but to the margin with separating
1051-8215/$25.00 © 2007 IEEE