IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012 1091 Harvesting Social Images for Bi-Concept Search Xirong Li, Cees G. M. Snoek, Senior Member, IEEE, Marcel Worring, Member, IEEE, and Arnold W. M. Smeulders, Member, IEEE Abstract—Searching for the co-occurrence of two visual con- cepts in unlabeled images is an important step towards answering complex user queries. Traditional visual search methods use com- binations of the condence scores of individual concept detectors to tackle such queries. In this paper we introduce the notion of bi-concepts, a new concept-based retrieval method that is directly learned from social-tagged images. As the number of potential bi-concepts is gigantic, manually collecting training examples is infeasible. Instead, we propose a multimedia framework to collect de-noised positive as well as informative negative training examples from the social web, to learn bi-concept detectors from these examples, and to apply them in a search engine for retrieving bi-concepts in unlabeled images. We study the behavior of our bi-concept search engine using 1.2 M social-tagged images as a data source. Our experiments indicate that harvesting examples for bi-concepts differs from traditional single-concept methods, yet the examples can be collected with high accuracy using a multi-modal approach. We nd that directly learning bi-concepts is better than oracle linear fusion of single-concept detectors, with a relative improvement of 100%. This study reveals the potential of learning high-order semantics from social images, for free, suggesting promising new lines of research. Index Terms—Bi-concept, semantic index, visual search. I. INTRODUCTION S EARCHING pictures on smart phones, PCs, and the Internet for specic visual concepts, such as objects and scenes, is of great importance for users with all sorts of infor- mation needs. As the number of images is growing so rapidly, full manual annotation is unfeasible. Therefore, automatically determining the occurrence of visual concepts in the visual content is crucial. Compared to low-level visual features such as color and local descriptors used in traditional content-based image retrieval, the concepts provide direct access to the se- mantics of the visual content. Thanks to continuous progress in generic visual concept detection [1]–[4], followed by novel exploitation of the individual detection results [5]–[8], an effective approach to unlabeled image search is dawning. Manuscript received June 28, 2011; revised November 26, 2011; accepted March 14, 2012. Date of publication April 03, 2012; date of current version July 13, 2012. This work was supported in part by the Dutch national program COMMIT and in part by the STW SEARCHER project. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Samson Cheung. X. Li is with the MOE Key Laboratory of Data Engineering and Knowl- edge Engineering, School of Information, Renmin University of China, Beijing, China (e-mail: xirong.li@gmail.com). C. G. M. Snoek, M. Worring, and A. W. M. Smeulders are with the Intelligent Systems Lab Amsterdam, University of Amsterdam, 1098 XH Amsterdam, The Netherlands. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2012.2191943 Fig. 1. Searching for two visual concepts co-occurring in unlabeled images. A (green) tick indicates a positive result. Given two single concept detectors with reasonable accuracy, a combination using their individual condence scores yields a bad retrieval result (c). We propose to answer the complex query using a bi-concept detector optimized in terms of mutual training examples (d). (a) Searching for “car” by a “car” detector. (b) Searching for “horse” by a “horse” detector. (c) Searching for “car horse” by combining the results of (a) and (b). (d) Searching for “car horse” using the proposed bi-concept search engine. In reality, however, a user’s query is often more complex than a single concept can represent [9]. For instance consider the query: “an image showing a horse next to a car”. To an- swer this query, one might expect to employ a “car” detector and a “horse” detector, and combine their predictions, which is indeed the mainstream approach in the literature [6]–[8], [10]–[12]. But is this approach effective? We observe that the single concept detectors are trained on typical examples of the corresponding concept, e.g., cars on a street for the “car” detector, and horses on grass for the “horse” detector. We hypothesize that images with horses and cars co-occurring also have a characteristic visual appearance, while the individual concepts might not be present in their common form. Hence, combining two reasonably accurate single concept detectors is mostly ineffective for nding images with both concepts visible, as illustrated in Fig. 1. Ideally, we treat the combination of the concepts as a new concept, which we term bi-concept. To be precise, we dene 1520-9210/$31.00 © 2012 IEEE