ORIGINAL RESEARCH A three-level architecture for bridging the image semantic gap Mohammed Belkhatir Received: 5 December 2008 / Accepted: 10 October 2010 / Published online: 16 November 2010 Ó Springer-Verlag 2010 Abstract Image retrieval systems face the problem of dealing with the different ways to apprehend the content of images and in particular the difﬁculty to characterize the visual semantics. To address this issue, we examine the use of three abstract levels of representation, namely Signal, Object and Semantic. At the Signal Level, we propose a framework mapping the extracted low-level features to symbolic signal descriptors. The Object Level features a statistical model considering the joint distribution of object concepts (such as mountains, sky…) and the symbolic signal descriptors. At the Semantic Level, signal and object characterizations are coupled within a logic-based frame- work. The latter is instantiated by a knowledge represen- tation formalism allowing to deﬁne an expressive query language consisting of several boolean and quantiﬁcation operators. Our architecture therefore makes it possible to process topic-based queries. Experimentally, we evaluate our theoretical proposition on a corpus of real-world pho- tographs and the TRECVid corpus. Keywords Multimedia processing  Semantic gap  Image indexing and retrieval  Experimental evaluation 1 Introduction Image indexing and retrieval systems, which have been the subject of extensive research works since the 1990s, can be categorized with respect to their index and query abstrac- tion level. We mainly identify three levels: The ﬁrst level, namely Signal Level, represents numerical abstractions of image regions. Such abstractions charac- terize the colors, textures… of visible elements in images. The general approach consists in computing structures representing the image distribution such as color histo- grams, texture features and using this data to partition the image; thus reducing the search space during the image retrieval operation. These methods hold the advantage of being fully automatic, thus are able to quickly process queries. Aspects related to human perception, which are of prime importance in image retrieval, are however not taken into account. In the remainder of the paper, this level is considered only as far as the automatic extraction of low- level signal features is concerned. In order to address the impossibility of the signal-based systems to characterize the image semantics (also called semantic gap [1]), the second level (namely Image Object Level) of representation supports the notion of labeling the image visual entities. This level intends to bridge a gap between the signal aspects (ﬁrst level) and the symbols representing the content of images. For this, two classes of automatic semantic extraction architectures have been proposed in the literature. The ﬁrst, which aims at cate- gorizing images in broad semantic classes, operates at the global image level. In [2], several experimental studies lead to the speciﬁcation of 20 semantic categories or image scenes describing the image content at a global level (such as group of people, cityscapes, landscapes…). Each of these categories is then linked to several low-level features gathered within the complete feature set. The most recent automatic annotation models linking annotation words to visual features are based on statistical models [3–8]. Blei and Jordan [3] extend Dirichlet’s latent allocation model Communicated by Wei-Ying Ma. M. Belkhatir (&) CNRS, University of Lyon, Lyon, France e-mail: mohammed.belkhatir@iut.univ-lyon1.fr 123 Multimedia Systems (2011) 17:135–148 DOI 10.1007/s00530-010-0207-8