Applying Feature Selection Techniques for Visual Dictionary Creation in Object Classiﬁcation I.M. Creusen 1 , R.G.J. Wijnhoven 1, 2 , and P.H.N. de With 1, 3 1 Video Coding and Architecture group, Eindhoven University of Technology, The Netherlands 2 ViNotion B.V., Eindhoven, The Netherlands 3 Cyclomedia Technology B.V., Waardenburg, The Netherlands Abstract— This paper introduces improved methods for visual dictionary creation in an object classiﬁcation system. In literature, the visual dictionary is often created from a large candidate set of features by random selection or by a clustering algorithm. We apply techniques from feature selection literature to create a more optimal visual dictionary and contribute with a novel feature selection algorithm. As a second step, feature extraction techniques for creating the candidate set are investigated. Subsequently, the size of the candidate set is varied. It was found that the exploitation of feature selection techniques gives a clear improvement of 2-5% in classiﬁcation rate at no additional computational cost in normal system operation. The proposed algorithm called extremal optimization, outperforms state-of-the-art algorithms. The paper discloses results on candidate set creation using interest point operators. As a general bonus, the evaluated feature selection techniques are generally applicable to any problem that uses a dictionary of features, as typically applied in the object recognition domain. Keywords: object recognition, feature evaluation and selection 1. Introduction With the ever increasing number of installed cameras, video surveillance personnel can be effectively assisted by extracting useful information from each video stream. State- of-the-art camera systems for video surveillance detect and track key objects in the monitored scene. Towards full scene understanding, recognition of these tracked objects is key. Given a set of object classes and an image containing one object, the task of object categorization is to determine the correct object class label of the visualized object. The operation of object categorization systems is divided in two phases: training and testing. During training, the system learns from the training set, consisting of a number of example images for each object class. The performance of the algorithm is determined as the percentage of correctly labeled objects from the test set, averaged over all object classes. This paper concentrates on feature selection in object categorization and shows that this concept contributes signiﬁcantly to an improved classiﬁcation score. Classiﬁcation of objects within images has been studied in earlier work. Early work by Agarwal et al. [1] uses a visual dictionary for car detection. The bag-of-words model for object classiﬁcation was recently pioneered by Csurka et al. in [2], and has received much attention [3], [4]. These models compare small object parts of the input image to the set of known object parts, called the visual dictionary. A common feature in this work is the method for constructing the dictionary, typically done by applying a clustering algorithm on features extracted from the training set. A biologically plausible object recognition framework (HMAX) was introduced by Riesenhuber and Poggio [5], recently optimized by Serre et al. [6]. Moreno et al. [7] have shown that HMAX performs slightly better in a categoriza- tion task than SIFT [8]. An interesting aspect of the HMAX system is that the visual dictionary is created by a different technique: features are extracted from random locations in images of natural scenery. Both the random and clustering techniques for constructing the dictionary do not optimize the dictionary speciﬁcally for the classiﬁcation task. We have used the HMAX object classiﬁcation system proposed by Serre et al. [6] as a starting point, which was explored by Wijnhoven and De With in [9]. The input image is ﬁltered using Gabor ﬁlters at different scales and orientations and the result is subsampled using a local MAX-operator. For each dictionary feature the best match in the input image is stored in the feature vector. The classiﬁer uses this vector to learn and determine the true object class. The computational complexity of labeling an unknown object depends linearly on the number of visual dictionary words. For an embedded camera implementation, the available computation power is strictly limited. In order to reduce this computational cost, we have optimized this system by creating a more discriminative dictionary with less visual words by applying feature selection techniques. Starting with a large candidate set of features, which are randomly extracted from images in the training set, the selection algorithms select the most distinctive features. For categorization, we compare the results of three common feature selection techniques. The investigated techniques outperform the random and clustering methods. Moreover, we adopt a new algorithm from optimization literature and exploit this successfully for dictionary creation. The visual dictionary is created during the training phase. This is an ofﬂine process without real-time constraints. Therefore, the