A Selective Weighted Late Fusion for Visual Concept Recognition Ningning Liu, Emmanuel Dellandrea, Chao Zhu, Charles-Edmond Bichot, and Liming Chen Universit´e de Lyon, CNRS, Ecole Centrale de Lyon, LIRIS, UMR5205, F-69622, France {ningning.liu,emmanuel.dellandrea,chao.zhu, charles-edmond.bichot,liming.chen}@ec-lyon.fr Abstract. We propose in this paper a novel multimodal approach to automatically predict the visual concepts of images through an eﬀective fusion of visual and textual features. It relies on a Selective Weighted Late Fusion (SWLF) scheme which, in optimizing an overall Mean in- terpolated Average Precision (MiAP), learns to automatically select and weight the best experts for each visual concept to be recognized. Exper- iments were conducted on the MIR Flickr image collection within the ImageCLEF 2011 Photo Annotation challenge. The results have brought to the fore the eﬀectiveness of SWLF as it achieved a MiAP of 43.69 % for the detection of the 99 visual concepts which ranked 2 nd out of the 79 submitted runs, while our new variant of SWLF allows to reach a MiAP of 43.93 %. Keywords: Visual concept recognition, multimodality, feature fusion. 1 Introduction Machine-based recognition of visual concepts aims at automatically recognizing high-level visual semantic concepts (HLSC), including scenes (e.g., indoor, out- door, landscape, etc.), objects (car, animal, person, etc.), events (travel, work, etc.), or even emotions (melancholic, happy, etc.). It proves to be extremely challenging because of large intra-class variations and inter-class similarities, clutter, occlusion and pose changes. The past decade has witnessed tremendous eﬀorts from the research communities as testiﬁed the multiple challenges in the ﬁeld, e.g., Pascal VOC [1], TRECVID [2] and ImageCLEF [3]. Most approaches to visual concept recognition (VCR) have so far focused on appropriate visual content description, and have featured a dominant bag-of-visual-words (BoVW) representation along with local SIFT descriptors. Meanwhile, increasing works in literature have discovered the wealth of semantic meanings conveyed by the abundant textual captions associated with images [4]. Therefore, multimodal approaches are proposed for VCR by making joint use of user textual tags and visual descriptions to bridge the gap between HLSC and low-level visual fea- tures. The work presented in this paper is in that line and targets an eﬀective feature fusion scheme for VCR. A. Fusiello et al. (Eds.): ECCV 2012 Ws/Demos, Part III, LNCS 7585, pp. 426–435, 2012. c  Springer-Verlag Berlin Heidelberg 2012