Automated Localization of Affective Objects and Actions in Images Via Caption Text-cum-Eye Gaze Analysis Subramanian Ramanathan † , Harish Katti † , Raymond Huang Z.W. ‡ , Tat-Seng Chua † , Mohan Kankanhalli † School of Computing † , Department of Psychology ‡ National University of Singapore {raman,harishk,chuats,mohan}@comp.nus.edu.sg,raymondhuang@nus.edu.sg ABSTRACT We propose a novel framework to localize and label aﬀec- tive objects and actions in images through a combination of text, visual and gaze-based analysis. Human gaze pro- vides useful cues to infer locations and interactions of af- fective objects. While concepts (labels) associated with an image can be determined from its caption, we demonstrate localization of these concepts upon learning from a statisti- cal aﬀect model for world concepts. The aﬀect model is derived from non-invasively acquired ﬁxation patterns on labeled images, and guides localization of aﬀective objects (faces, reptiles ) and actions (look, read ) from ﬁxations in un- labeled images. Experimental results obtained on a database of 500 images conﬁrm the eﬀectiveness and promise of the proposed approach. Categories and Subject Descriptors: H.4 [Information Systems Applications]: Multimedia Application General Terms: Human Factors, Algorithms. Keywords: Automated localization and labeling, caption text-cum-eye gaze analysis, aﬀect model for world concepts, statistical model. 1. INTRODUCTION Image understanding remains an unsolved problem, de- spite the many advances in computer vision. Description of natural images involves automated segmentation and recog- nition of the various scene objects appearing at multiple scales and orientations, which has inspired LabelMe [11]. Diﬃculty in determining image objects (concepts) from vi- sual content has necessitated image retrieval algorithms [2] to rely on associated keywords and captions for image search. Noise associated with text-based image retrieval led to the development of Supervised Multiclass labeling (SML) [1], which segments and labels unknown images by applying gained knowledge on the extracted ’bag of features’. How- ever, the algorithm requires extensive training and fails to address the semantic gap. An urn model for object recall is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. MM’09, October 19–24, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-608-3/09/10 ...$10.00. used in [12] to establish the importance of some scene ob- jects, even in simple scenes. Also, observations made from eye-gaze statistics in [5] suggest that humans are attentive to interesting objects in semantically rich photographs. Eye gaze measurements have been employed for modeling user attention in a number of applications including visual search for Human-Computer Interaction (HCI) [7] and open signed video analysis [3]. [9] employs low-level image fea- tures (contrast, intensity etc.) for computing a saliency map to predict human gaze. However, as noted in [5], objects drive attention for semantically rich images, while low-level saliency contributes only indirectly. This paper is perhaps most similar to [10], where caption text and image segments are combined to localize the subject of a natural image. Instead, we focus on localizing aﬀective (attention grabbing, emotion evoking) concepts in images. Contrary to the notion that human subjectivity inﬂuences the choice of interesting scene objects, we observe that af- fective concepts are consistently ﬁxated upon by a majority of subjects. These concepts may correspond to individual objects or interactions between two objects (actions). An aﬀect model for world concepts is derived from ﬁxation patterns for labeled images. The aﬀect model encodes world ontology as a tree, whose vertex weights denote concept af- fectiveness, and helps localize the most aﬀective concepts corresponding to the caption of an unlabeled image. Since eye-gaze is a strong indicator of visual attention, the pro- posed aﬀect model can be easily extended to include inter- esting objects in semantically rich images. Fig.1 demonstrates automatic labeling of generic faces us- ing the proposed approach. Labeled images (Figs.1(a),(b)) are used for learning aﬀective image concepts. Subject ﬁxa- tion patterns for these images, where a ﬁxation is deﬁned as attention around a point for a minimum time period (100 msec for our experiments), are shown in Figs.1(d),(e)). Distinct colors represent ﬁxation patterns for diﬀerent sub- jects, numbers denote the sequence of ﬁxations while circle sizes denote the ﬁxation duration around a point. While the training images include labels like body, grass etc, we observe a majority of ﬁxations on the face, implying that faces are aﬀective. Also, most ﬁxations within the face are observed around the eyes, nose and mouth. Fig.1(c) is an unlabeled image with known ﬁxation patterns (Fig.1(f)), and whose caption reads ’A cute cat face’. The hierarchy of aﬀective concepts for Fig. 1(c) is deter- mined through the aﬀect model as face →{nose +mouth, eyes }. Using JSEG segmentation [4] as a guide, recursive ﬁxation clustering is employed for aﬀective concept localiza- 729