Object Category Detection Using Audio-Visual Cues Jie Luo 12 , Barbara Caputo 12 , Alon Zweig 3 , J¨ org-Hendrik Bach 4 , and J ¨ orn Anem¨ uller 4 1 IDIAP Research Institute, Centre du Parc, 1920 Martigny, Switzerland 2 Swiss Federal Institute of Technology in Lausanne(EPFL), 1015 Lausanne, Switzerland 3 Hebrew university of Jerusalem, 91904 Jerusalem, Israel 4 Carl von Ossietzky University Oldenburg, 26111 Oldenburg, Germany {jluo,bcaputo}@idiap.ch, zweiga@cs.huji.ac.il, {joerg-hendrik.bach,joern.anemueller}@uni-oldenburg.de Abstract. Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision do- main, even though cognitive agents are intrinsically multimodal. Indeed, biologi- cal systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologi- cally motivated spectral features via a discriminative classiﬁer. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difﬁcult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection. Keywords: Object Categorization, Multimodal Recognition, Audio-visual Fusion. 1 Introduction The capability to categorize is a fundamental component of cognitive systems. It can be considered as the building block of the capability to think itself [1]. Its importance for artiﬁcial systems is widely recognized, as witnessed by a vast literature (see [2,3] and references therein). Traditionally, categorization has been studied from an unimodal perspective (with some notable exceptions, see [4] and references therein). For instance, during the last ﬁve years the computer vision community has attacked the object cat- egorization problem by (a) developing algorithms for detection of speciﬁc categories like cars, cows, pedestrian and many others [2,3]; (b) collecting several benchmark databases and promoting benchmark evaluations for assessing progresses in the ﬁeld. The emerging paradigm from these activities is the so-called ‘part-based approach’, where visual categories are modeled on the basis of local information. This information is then used to build a learning based algorithm for classiﬁcation. Both probabilistic and discriminative approaches have been used so far with promising results. Still, an algorithm aiming to work on an autonomous system cannot ignore the in- trinsic multimodal nature of categories, and the multi sensory capabilities of the system. A. Gasteratos, M. Vincze, and J.K. Tsotsos (Eds.): ICVS 2008, LNCS 5008, pp. 539–548, 2008. c  Springer-Verlag Berlin Heidelberg 2008