324 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 3, MARCH 2007 Semantic Home Photo Categorization Seungji Yang, Sang-Kyun Kim, and Yong Man Ro, Senior Member, IEEE Abstract—A semantic categorization method for generic home photo is proposed. The main contribution of this paper is to exploit a two-layered classiﬁcation model incorporating camera metadata with low-level features for multilabel detection. The two-layered support vector machine (SVM) classiﬁers operate to detect local and global photo semantics in a feed-forward way. The ﬁrst layer aims to predict likelihood of predeﬁned local photo semantics based on camera metadata and regional low-level visual features. In the second layer, one or more global photo semantics is detected based on the likelihood. To construct classiﬁers pro- ducing a posterior probability, we use a parametric model to ﬁt the output of SVM classiﬁers to posterior probability. A concept merging process based on a set of semantic-conﬁdence maps is also presented to cope with selecting more likelihood photo semantics on spatially overlapping local regions. Experiment was performed with 3086 photos that come from MPEG-7 visual core experiment two ofﬁcial databases. Results showed that the proposed method would much better capture multiple semantic meanings of home photos, compared to other similar technologies. Index Terms—Camera metadata, image classiﬁcation, photo album, support vector machine. I. INTRODUCTION T HE GOAL of semantic image categorization is to dis- cover the image semantics from a domain of some given predeﬁned concepts, such as building, waterside, landscape, cityscape, and so forth. Recently, as it is affordable to keep a complete digital record of one’s whole life, the need for semantic categorization has been raised in both organizing and managing personal photo collection for minimizing user’s manual efforts. Conventionally, many researches have advanced semantic image indexing and categorization in recent decades [1]–[9]. They mostly focused on reducing the semantic gap between low-level visual features and high-level semantic descrip- tions, which are closer to human visual perception. Herein, one primary tackling point is learning approach itself so that classiﬁer realizes minimal bound of error in real application. In particular, statistical learning approaches, such as Bayesian probability model [32], Markov random ﬁelds (MRF) [1], and support vector machines (SVM) [2], have been successfully employed for semantic categorization. A statistical learning Manuscript received May 6, 2006; revised November 20, 2006. This paper was recommended by Guest Editor E. Izquierdo. S. Yang and Y. M. Ro are with the Information and Communications University (ICU), 103-6 Daejeon, South Korea (e-mail: yangzeno@icu.ac.kr; yro@icu.ac.kr). S.-K. Kim is with the Samsung Advanced Institute of Technology (SAIT), 14-1 Gyeonggi, South Korea (e-mail: skkim@sait.samsung.co.kr). Color versions of one or more of the ﬁgures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identiﬁer 10.1109/TCSVT.2007.890829 process commonly includes three steps: 1) observing a phe- nomenon in the real world; 2) constructing a model of the phenomenon; and 3) making predictions using the model, step by step. A useful approach to build the model of a classiﬁer is to employ discriminative features besides low-level visual features or any combination of both. J. Smith et al., in [7], has proposed semantic image/video indexing using semantic model vectors that are constructed from multiple low-level features where the model vector stands for a set of numerical degrees of strength in relation to different semantic meanings. For better classiﬁcation, in [2] and [9], spatial image context features have been coupled with low-level features as well. Human beings sense many levels of visual semantics in photo. However, semantic labels to be discovered are generally lim- ited in a speciﬁc domain according to application, due to uncer- tainty and inﬁnity of semantic knowledge of human beings. Al- though semantic object segmentation has been implemented by a wide range of approaches for last two decades [33]–[36], how to detect multiple semantic concepts in image is still challenging problem due to low performance and high computational cost. The problems in semantic categorization can be simpliﬁed by using multilayered rather than single-layered approach. Having multiple layers in classiﬁcation often help to solve a classical image understanding problem that requires effective interaction of high-level semantics and low-level features. The way human beings perceive semantic knowledge of an image is hierarchical. In other words, human beings ﬁrstly sense rough, rather simple semantic objects, and then compound them to understand more comprehensively detailed semantic meanings of the image. This sensory mechanism can be imitated by a multilayered learning way. Multilayered approach usually forms a speciﬁc hierarchy of layers with one or more classiﬁers. A classiﬁer in the lower layer aims to capture simple semantic aspects by using low-level features while a classiﬁer in the higher layer interprets more complex semantic aspects by using high-level semantic features. Many researchers have employed the multilayered approaches to semantic categorization [5], [6], [9], [13]–[15]. One state-of-the-art classiﬁcation method is using SVM [10], [11]. Many conventional classiﬁers have targeted empirical risk minimization (ERM). But, ERM only utilizes the loss function deﬁned for a classiﬁer and is equivalent to Bayesian decision theory with a particular choice of prior. Thus, an ERM approach often leads to an over-ﬁtted classiﬁer, i.e., classiﬁer is usually too much adapted only to training data. Unlike ERM, structural risk minimization (SRM) minimizes generalization error. The generalization error is bounded by the sum of training set error and a term depending on Vapnik–Chervonenkis (VC) dimension of the learning machine. High generalization can be archived by minimizing the upper bound. SVM is based on the idea of SRM. The generalization error of SVM is related not to the input dimensionality of the problem, but to the margin with separating 1051-8215/$25.00 © 2007 IEEE