Automated Localization of Affective Objects and Actions in Images Via Caption Text-cum-Eye Gaze Analysis Subramanian Ramanathan , Harish Katti , Raymond Huang Z.W. , Tat-Seng Chua , Mohan Kankanhalli School of Computing , Department of Psychology National University of Singapore {raman,harishk,chuats,mohan}@comp.nus.edu.sg,raymondhuang@nus.edu.sg ABSTRACT We propose a novel framework to localize and label affec- tive objects and actions in images through a combination of text, visual and gaze-based analysis. Human gaze pro- vides useful cues to infer locations and interactions of af- fective objects. While concepts (labels) associated with an image can be determined from its caption, we demonstrate localization of these concepts upon learning from a statisti- cal affect model for world concepts. The affect model is derived from non-invasively acquired fixation patterns on labeled images, and guides localization of affective objects (faces, reptiles ) and actions (look, read ) from fixations in un- labeled images. Experimental results obtained on a database of 500 images confirm the effectiveness and promise of the proposed approach. Categories and Subject Descriptors: H.4 [Information Systems Applications]: Multimedia Application General Terms: Human Factors, Algorithms. Keywords: Automated localization and labeling, caption text-cum-eye gaze analysis, affect model for world concepts, statistical model. 1. INTRODUCTION Image understanding remains an unsolved problem, de- spite the many advances in computer vision. Description of natural images involves automated segmentation and recog- nition of the various scene objects appearing at multiple scales and orientations, which has inspired LabelMe [11]. Difficulty in determining image objects (concepts) from vi- sual content has necessitated image retrieval algorithms [2] to rely on associated keywords and captions for image search. Noise associated with text-based image retrieval led to the development of Supervised Multiclass labeling (SML) [1], which segments and labels unknown images by applying gained knowledge on the extracted ’bag of features’. How- ever, the algorithm requires extensive training and fails to address the semantic gap. An urn model for object recall is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’09, October 19–24, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-608-3/09/10 ...$10.00. used in [12] to establish the importance of some scene ob- jects, even in simple scenes. Also, observations made from eye-gaze statistics in [5] suggest that humans are attentive to interesting objects in semantically rich photographs. Eye gaze measurements have been employed for modeling user attention in a number of applications including visual search for Human-Computer Interaction (HCI) [7] and open signed video analysis [3]. [9] employs low-level image fea- tures (contrast, intensity etc.) for computing a saliency map to predict human gaze. However, as noted in [5], objects drive attention for semantically rich images, while low-level saliency contributes only indirectly. This paper is perhaps most similar to [10], where caption text and image segments are combined to localize the subject of a natural image. Instead, we focus on localizing affective (attention grabbing, emotion evoking) concepts in images. Contrary to the notion that human subjectivity influences the choice of interesting scene objects, we observe that af- fective concepts are consistently fixated upon by a majority of subjects. These concepts may correspond to individual objects or interactions between two objects (actions). An affect model for world concepts is derived from fixation patterns for labeled images. The affect model encodes world ontology as a tree, whose vertex weights denote concept af- fectiveness, and helps localize the most affective concepts corresponding to the caption of an unlabeled image. Since eye-gaze is a strong indicator of visual attention, the pro- posed affect model can be easily extended to include inter- esting objects in semantically rich images. Fig.1 demonstrates automatic labeling of generic faces us- ing the proposed approach. Labeled images (Figs.1(a),(b)) are used for learning affective image concepts. Subject fixa- tion patterns for these images, where a fixation is defined as attention around a point for a minimum time period (100 msec for our experiments), are shown in Figs.1(d),(e)). Distinct colors represent fixation patterns for different sub- jects, numbers denote the sequence of fixations while circle sizes denote the fixation duration around a point. While the training images include labels like body, grass etc, we observe a majority of fixations on the face, implying that faces are affective. Also, most fixations within the face are observed around the eyes, nose and mouth. Fig.1(c) is an unlabeled image with known fixation patterns (Fig.1(f)), and whose caption reads ’A cute cat face’. The hierarchy of affective concepts for Fig. 1(c) is deter- mined through the affect model as face →{nose +mouth, eyes }. Using JSEG segmentation [4] as a guide, recursive fixation clustering is employed for affective concept localiza- 729