Analysis of new techniques to obtain quality training sets J.S. S anchez a, * , R. Barandela b , A.I. Marqu es a , R. Alejo b , J. Badenas a a Universitat Jaume I, Av. Vicent Sos Baynat s/n, 12006 Castell on, Spain b Instituto Tecnol ogico de Toluca, Av. Tecnol ogico s/n, 52140 Metepec, Mexico Abstract This paper presents new algorithms to identify and eliminate mislabelled, noisy and atypical training samples for supervised learning and more specifically, for nearest neighbour classification. The main goal of these approaches is to enhance the classification accuracy by improving the quality of the training data. Several experiments with synthetic and real data sets are carried out in order to illustrate the behaviour of the schemes proposed here and compare their performancewiththatofothertraditionaltechniques.Itisalsoanalysedtheabilityofthesenewalgorithmsto‘‘reduce’’ the possible overlapping among regions of different classes. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Nearest neighbour; Editing; Classification accuracy; Nearest centroid neighbourhood; Outlier; Quality training set 1. Introduction One goal of any learning algorithm is to form a generalization from a set of labelled training samples such that the classification accuracy for new samples is maximised. The maximum accu- racyachievabledependsonthequalityoftheinput data and on the appropriateness of the chosen learning algorithm for the data. The work described in this paper concentrates on improving quality of training data by identi- fying and eliminating mislabelled and atypical samples prior to applying the chosen learning scheme, thereby increasing classification accuracy. An immediate positive effect of eliminating such samples refers to the fact that the possible over- lapping among different classes is drastically re- duced. The problem of handling mislabelled, atypical and noisy training samples has been the focus of much attention in both pattern recognition and machine learning domains (Devijver and Kittler, 1982; Brodley and Friedl, 1999; Wilson and Mar- tinez, 2000). For example, extensive efforts have been given to the improvement of the classification performance of the well-known nearest neighbour (NN) rule. Accordingly, this paper addresses the prob- lem of selecting prototypes in order to improve the classification accuracy of an NN classifier. In this context, some approaches to remove outliers from the training set (TS) are here introduced. An Pattern Recognition Letters 24 (2003) 1015–1022 www.elsevier.com/locate/patrec * Corresponding author. Tel.: +34-964-728350; fax: +34-964- 728435. E-mail addresses: sanchez@uji.es (J.S. S anchez), rbaran- dela@hotmail.com (R. Barandela). 0167-8655/03/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-8655(02)00225-8