Using Dominant Sets for k-NN Prototype Selection Sebastiano Vascon 1 , Marco Cristani 1 , Marcello Pelillo 2 , and Vittorio Murino 1 1 Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia Via Morego 30, 16163 Genova, Italy {sebastiano.vascon,marco.cristani,vittorio.murino}@iit.it 2 DAIS, University Ca’Foscari of Venice, Via Torino 155, 30172 Venezia Mestre, Italy pelillo@dsi.unive.it Abstract. k-Nearest Neighbors is surely one of the most important and widely adopted non-parametric classiﬁcation methods in pattern recog- nition. It has evolved in several aspects in the last 50 years, and one of the most known variants consists in the usage of prototypes: a proto- type distills a group of similar training points, diminishing drastically the number of comparisons needed for the classiﬁcation; actually, prototypes are employed in the case the cardinality of the training data is high. In this paper, by using the dominant set clustering framework, we propose four novel strategies for the prototype generation, allowing to produce representative prototypes that mirror the underlying class structure in an expressive and eﬀective way. Our strategy boosts the k-NN classi- ﬁcation performance; considering heterogeneous metrics and analyzing 15 diverse datasets, we are among the best 6 prototype-based k-NN ap- proaches, with a computational cost which is strongly inferior to all the competitors. In addition, we show that our proposal beats linear SVM in the case of a pedestrian detection scenario. Keywords: K-nearest neighbors, Prototype selection, Classiﬁcation, Dominant set, Data reduction. 1 Introduction The k-Nearest Neighbors (kNN) method [3] is one of the fundamental strategies of non-parametric supervised classiﬁcation, with a large spectrum of variations proposed in the last 50 years. It lies on a simple principle: a test sample is classiﬁed as one of the N available classes by evaluating the majority of the labels of the k nearest training neighbors in the feature space. k-NN, in its original form, suﬀers two major drawbacks. The ﬁrst is the scarce eﬃciency, especially when the training dataset becomes large [10]: this is due to the fact that all the distances between the test element and the training set should be computed, to ﬁnd the k closest elements. The second problem is the sensitivity to the outliers [6]: during the classiﬁcation, all the training data are treated in the same way, ignoring the possibility of taking into account class outliers or A. Petrosino (Ed.): ICIAP 2013, Part II, LNCS 8157, pp. 131–140, 2013. c  Springer-Verlag Berlin Heidelberg 2013