Analysis of new techniques to obtain quality training sets J.S. S anchez a, * , R. Barandela b , A.I. Marqu es a , R. Alejo b , J. Badenas a a Universitat Jaume I, Av. Vicent Sos Baynat s/n, 12006 Castell on, Spain b Instituto Tecnol ogico de Toluca, Av. Tecnol ogico s/n, 52140 Metepec, Mexico Abstract This paper presents new algorithms to identify and eliminate mislabelled, noisy and atypical training samples for supervised learning and more speciﬁcally, for nearest neighbour classiﬁcation. The main goal of these approaches is to enhance the classiﬁcation accuracy by improving the quality of the training data. Several experiments with synthetic and real data sets are carried out in order to illustrate the behaviour of the schemes proposed here and compare their performancewiththatofothertraditionaltechniques.Itisalsoanalysedtheabilityofthesenewalgorithmsto‘‘reduce’’ the possible overlapping among regions of diﬀerent classes. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Nearest neighbour; Editing; Classiﬁcation accuracy; Nearest centroid neighbourhood; Outlier; Quality training set 1. Introduction One goal of any learning algorithm is to form a generalization from a set of labelled training samples such that the classiﬁcation accuracy for new samples is maximised. The maximum accu- racyachievabledependsonthequalityoftheinput data and on the appropriateness of the chosen learning algorithm for the data. The work described in this paper concentrates on improving quality of training data by identi- fying and eliminating mislabelled and atypical samples prior to applying the chosen learning scheme, thereby increasing classiﬁcation accuracy. An immediate positive eﬀect of eliminating such samples refers to the fact that the possible over- lapping among diﬀerent classes is drastically re- duced. The problem of handling mislabelled, atypical and noisy training samples has been the focus of much attention in both pattern recognition and machine learning domains (Devijver and Kittler, 1982; Brodley and Friedl, 1999; Wilson and Mar- tinez, 2000). For example, extensive eﬀorts have been given to the improvement of the classiﬁcation performance of the well-known nearest neighbour (NN) rule. Accordingly, this paper addresses the prob- lem of selecting prototypes in order to improve the classiﬁcation accuracy of an NN classiﬁer. In this context, some approaches to remove outliers from the training set (TS) are here introduced. An Pattern Recognition Letters 24 (2003) 1015–1022 www.elsevier.com/locate/patrec * Corresponding author. Tel.: +34-964-728350; fax: +34-964- 728435. E-mail addresses: sanchez@uji.es (J.S. S anchez), rbaran- dela@hotmail.com (R. Barandela). 0167-8655/03/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-8655(02)00225-8