Hybrid Pooling Fusion in the BoW Pipeline Marc Law, Nicolas Thome, and Matthieu Cord LIP6, UPMC - Sorbonne University, Paris, France {Marc.Law,Nicolas.Thome,Matthieu.Cord}@lip6.fr Abstract. In the context of object and scene recognition, state-of-the- art performances are obtained with Bag of Words (BoW) models of mid- level representations computed from dense sampled local descriptors (e.g. SIFT). Several methods to combine low-level features and to set mid-level parameters have been evaluated recently for image classification. In this paper, we further investigate the impact of the main parameters in the BoW pipeline. We show that an adequate combination of several low (sampling rate, multiscale) and mid level (codebook size, normaliza- tion) parameters is decisive to reach good performances. Based on this analysis, we propose a merging scheme exploiting the specificities of edge- based descriptors. Low and high-contrast regions are pooled separately and combined to provide a powerful representation of images. Sucessful experiments are provided on the Caltech-101 and Scene-15 datasets. 1 Introduction and Related Work Image classification is one of the most challenging problems in computer vi- sion. Indeed, the prediction of complex semantic categories, such as scenes or objects, from the pixel level, is still a very hard task. Two main breakthroughs have been reached in the last decade to achieve this goal. The first one is the design of discriminative low-level local features, such as SIFT [1]. The second one is the emergence of mid-level representations inspired from the text retrieval community, based on the Bag of Words (BoW) model [2]. In the BoW model, converting the set of local descriptors into the final image representation is performed by a succession of two steps: coding and pooling. In the original BoW model, coding consists in hard assigning each local descriptor to the closest visual word, while pooling averages the local descriptor projec- tions. One important limitation of the visual BoW model is the lack of spatial information. The most popular extension to overcome this problem is the Spa- tial Pyramid Scheme [3]. In addition, many efforts have been recently devoted to improve coding and pooling [4]. To attenuate the quantization loss, soft as- signment attempts to smoothly distribute features to the codewords [5, 6]. In sparse coding approaches [7–9], there is an explicit minimization of the feature reconstruction error, along with a regularization prior that encourages sparse solutions. Different pooling strategies have also been studied. Max pooling is a promising alternative to sum pooling [6–10], especially when linear classifiers are used. Therefore, the combination of sparse coding, spatial pyramids and max- pooling is often regarded as the strategy leading to state-of-the-art performances. A. Fusiello et al. (Eds.): ECCV 2012 Ws/Demos, Part III, LNCS 7585, pp. 355–364, 2012. c Springer-Verlag Berlin Heidelberg 2012