IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012 1091
Harvesting Social Images for Bi-Concept Search
Xirong Li, Cees G. M. Snoek, Senior Member, IEEE, Marcel Worring, Member, IEEE, and
Arnold W. M. Smeulders, Member, IEEE
Abstract—Searching for the co-occurrence of two visual con-
cepts in unlabeled images is an important step towards answering
complex user queries. Traditional visual search methods use com-
binations of the confidence scores of individual concept detectors
to tackle such queries. In this paper we introduce the notion of
bi-concepts, a new concept-based retrieval method that is directly
learned from social-tagged images. As the number of potential
bi-concepts is gigantic, manually collecting training examples
is infeasible. Instead, we propose a multimedia framework to
collect de-noised positive as well as informative negative training
examples from the social web, to learn bi-concept detectors from
these examples, and to apply them in a search engine for retrieving
bi-concepts in unlabeled images. We study the behavior of our
bi-concept search engine using 1.2 M social-tagged images as a
data source. Our experiments indicate that harvesting examples
for bi-concepts differs from traditional single-concept methods,
yet the examples can be collected with high accuracy using a
multi-modal approach. We find that directly learning bi-concepts
is better than oracle linear fusion of single-concept detectors, with
a relative improvement of 100%. This study reveals the potential
of learning high-order semantics from social images, for free,
suggesting promising new lines of research.
Index Terms—Bi-concept, semantic index, visual search.
I. INTRODUCTION
S
EARCHING pictures on smart phones, PCs, and the
Internet for specific visual concepts, such as objects and
scenes, is of great importance for users with all sorts of infor-
mation needs. As the number of images is growing so rapidly,
full manual annotation is unfeasible. Therefore, automatically
determining the occurrence of visual concepts in the visual
content is crucial. Compared to low-level visual features such
as color and local descriptors used in traditional content-based
image retrieval, the concepts provide direct access to the se-
mantics of the visual content. Thanks to continuous progress
in generic visual concept detection [1]–[4], followed by novel
exploitation of the individual detection results [5]–[8], an
effective approach to unlabeled image search is dawning.
Manuscript received June 28, 2011; revised November 26, 2011; accepted
March 14, 2012. Date of publication April 03, 2012; date of current version
July 13, 2012. This work was supported in part by the Dutch national program
COMMIT and in part by the STW SEARCHER project. The associate editor
coordinating the review of this manuscript and approving it for publication was
Dr. Samson Cheung.
X. Li is with the MOE Key Laboratory of Data Engineering and Knowl-
edge Engineering, School of Information, Renmin University of China, Beijing,
China (e-mail: xirong.li@gmail.com).
C. G. M. Snoek, M. Worring, and A. W. M. Smeulders are with the Intelligent
Systems Lab Amsterdam, University of Amsterdam, 1098 XH Amsterdam, The
Netherlands.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2012.2191943
Fig. 1. Searching for two visual concepts co-occurring in unlabeled images. A
(green) tick indicates a positive result. Given two single concept detectors with
reasonable accuracy, a combination using their individual confidence scores
yields a bad retrieval result (c). We propose to answer the complex query
using a bi-concept detector optimized in terms of mutual training examples
(d). (a) Searching for “car” by a “car” detector. (b) Searching for “horse” by
a “horse” detector. (c) Searching for “car horse” by combining the results
of (a) and (b). (d) Searching for “car horse” using the proposed bi-concept
search engine.
In reality, however, a user’s query is often more complex
than a single concept can represent [9]. For instance consider
the query: “an image showing a horse next to a car”. To an-
swer this query, one might expect to employ a “car” detector
and a “horse” detector, and combine their predictions, which
is indeed the mainstream approach in the literature [6]–[8],
[10]–[12]. But is this approach effective? We observe that the
single concept detectors are trained on typical examples of
the corresponding concept, e.g., cars on a street for the “car”
detector, and horses on grass for the “horse” detector. We
hypothesize that images with horses and cars co-occurring also
have a characteristic visual appearance, while the individual
concepts might not be present in their common form. Hence,
combining two reasonably accurate single concept detectors
is mostly ineffective for finding images with both concepts
visible, as illustrated in Fig. 1.
Ideally, we treat the combination of the concepts as a new
concept, which we term bi-concept. To be precise, we define
1520-9210/$31.00 © 2012 IEEE