MD-ELM: Originally Mislabeled Samples Detection using OP-ELM Model Anton Akusok a,n , David Veganzones b , Yoan Miche c,f , Kaj-Mikael Björk d , Philippe du Jardin e , Eric Severin b , Amaury Lendasse a,d a Department of Mechanical and Industrial Engineering, Iowa Informatics Initiative, The University of Iowa, Iowa City, IA 52242-1527, USA b University of Lille 1, IAE, 104 avenue du peuple Belge, 59043 Lille, France c Department of Information and Computer Science, Aalto University School of Science, FI-00076, Finland d Arcada University of Applied Science, Helsinki, Finland e EDHEC Business School, BP3116, 06202 Nice cedex 3, France f Nokia Solutions and Networks Group, Espoo, Finland article info Article history: Received 28 September 2014 Received in revised form 28 December 2014 Accepted 27 January 2015 Communicated by G-B. Huang Keywords: Mislabels Extreme Learning Machine Classiﬁcation abstract This paper proposes a methodology for identifying data samples that are likely to be mislabeled in a c- class classiﬁcation problem (dataset). The methodology relies on an assumption that the generalization error of a model learned from the data decreases if a label of some mislabeled sample is changed to its correct class. A general classiﬁcation model used in the paper is OP-ELM; it also provides a fast way to estimate the generalization error by PRESS Leave-One-Out. It is tested on two toy datasets, as well as on real life datasets for one of which expert knowledge about the identiﬁed potential mislabels has been sought. & 2015 Elsevier B.V. All rights reserved. 1. Introduction This work focuses on ﬁnding data samples with incorrect labels in a given dataset. Such samples create “label noise”, which is generally considered more harmful than feature noise [1]. The work is motivated by studies on a ﬁnancial dataset [2] where each sample corresponds to a company labeled as either healthy or bankrupt. In this dataset, samples with incorrect labels are impor- tant both by themselves (i.e. as companies eligible for a loan but mislabeled by “bankrupt”), and for the whole dataset to allow building more precise machine learning models with a limited amount of data (because each sample is expensive and slow to obtain). There are other areas where the detection of particular mislabeled samples is important, like medical applications [3]. There exist multiple sources of label noise. First, noise can be generated by simple mistakes in data gathering and processing, like people mistyping or sensor malfunction [4,1]. For real data- sets, such noise is estimated to be roughly 5% not including other factors [5]. Second, experts who label the data can make mistakes. This happens especially in cases where labeling quality is traded for lower labeling price, for instance with crowdsourcing [6] like Amazon Mechanical Turk [7] framework. Third, labeling criterion may be vague, then different experts will produce different labels. For example, in EEG segmentation exact beginnings and ends of signals are not formally deﬁned, and different doctors give slightly different signal boundaries [8]. At last, the existing information may be insufﬁcient for reliable labeling of data [4]. Recent methods of machine learning in the presence of mis- labeled data can be aggregated in three categories [9]. Data cleansing (or ﬁltering) methods [4] pre-process dataset and ﬁx incorrect labels or remove such samples [10]. The resulting clean dataset is used with general machine learning methods. Noise- robust methods [11,12] like k-nearest neighbors [13] are tuned to perform well despite the presence of label noise. It is even possible to achieve same theoretical performance with label noise as without one, although in simple cases [14]. Noise-tolerant meth- ods include label noise in their model; an extensive overview of such methods is presented in [9], Section 7. A good survey is given by Frénay in his PhD thesis [15]. The idea of detecting mislabeled samples is to utilize their effect of increasing the model complexity [4,16]. In a Single Layer Feed-forward Neural network (SLFN) model, a more complex model requires more hidden neurons to learn [24], or equally a more complex model will result in a lower accuracy with the same amount of hidden neurons. Correcting an incorrect sample label will decrease a model error of a ﬁxed SLFN. Note a difference between mislabeled samples and outliers—an outlier is not a Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2015.01.055 0925-2312/& 2015 Elsevier B.V. All rights reserved. n Corresponding author. Please cite this article as: A. Akusok, et al., MD-ELM: Originally Mislabeled Samples Detection using OP-ELM Model, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.01.055i Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎