Comparison of six multiclass classiﬁers by the use of different classiﬁcation performance indicators Dániel Szöllősi*, Dénes Lajos Dénes, Ferenc Firtha, Zoltán Kovács and András Fekete Classiﬁcation problems are very important, and generally, the question is which is the best model. Several classiﬁca- tion performance indicators including the classiﬁcation accuracy value (ACC), Cohen’s kappa (KAPPA), or the area under the ROC curve (AUC) are used to answer this question. There are non-parametric comparative methods such as the sum of ranking differences method. The objective of this work was to ﬁnd the best classiﬁcation method to classify four soft drink samples and four model samples, which differ from each other only in the sweetener compo- sition. Model samples were used to be basic samples for comparison with the commercial soft drinks. Six different classiﬁcation methods were compared according to their classiﬁcation performance. A corrected classiﬁcation accu- racy value (corrected ACC) was developed for the purpose and was introduced. This value takes into account the sim- ilarities between the classes. The results showed that the ACC value and the KAPPA values give similar results in our case. The best three models according to the ACC, KAPPA, and AUC were “K-nearest neighbor,”“random forest,” and “discriminant analysis.” However, the corrected ACC value showed a bit different ranking, and the random forest model was neglected from the good models. The confusion matrices of the models conﬁrmed the ranking according to the corrected ACC value. The results showed that the best classiﬁcation model was the K-nearest neighbor for the available samples, and the corrected ACC value is a useful classiﬁcation performance indicator. Copyright © 2012 John Wiley & Sons, Ltd. Keywords: multiclass classiﬁer; classiﬁcation performance; classiﬁcation accuracy; Cohen’ s kappa; AUC; sum of ranking differences 1. INTRODUCTION There is a deﬁnite need for multiclass classiﬁcation models in the ﬁeld of data processing and evaluation. Several classiﬁcation methods are available, such as linear discriminant analysis (DA), K-nearest neighbor (KN), and decision tree (DT). The comparison of different models can be performed by the classiﬁcation per- formance obtained from the validation of the models. Therefore, the validation method is a critical point of the model building [1]. Generally, internal validation is used, which is usually not the best but the only method that is available. A better validation method is the external validation, but unfortunately, we have rarely two or more independent data sets. Regardless of the validation procedure, the model is usually selected according to classiﬁcation performance indicators. The most widely used indicator is the classiﬁcation accuracy value (ACC). This parame- ter is calculated from the confusion matrix and compares the correctly classiﬁed cases to all cases. The biggest advantage of ACC is the simplicity. However, it also has many drawbacks. When using ACC, we assume that the error that is made because of the misclassiﬁcation is the same in every possible case. Never- theless, it is a very rare situation, and in this case, the results can be misleading [2,3]. A better indicator can be Cohen’ s kappa, which takes into account the random correct classiﬁcation and gives a “chance corrected coefﬁcient of agreement” [4]. If the cost of misclassiﬁcation is unknown, then the best approach for objective model comparison is the area under the ROC curve, which is also known as AUC, described by Bradley [5]. The AUC gives an overall evaluation about classiﬁcation abilities of the models. However, according to Hand [6], the AUC has some disadvantages. The AUC gives misleading information if the ROC curves are crossing. Moreover “it is fundamentally incoher- ent in terms of misclassiﬁcation costs: the AUC uses different misclassiﬁcation cost distributions for different classiﬁers” [6]. Hand recommended another indicator called “H measure” to solve these problems, but unfortunately, it can be used only for two-class classiﬁcation [6]. Moreover, according to Marzban [7], the AUC value can discriminate well between “good” and “bad” models but hardly between two “good” models. There are also non-parametric model comparison methods. Héberger and Kollár-Hunek [8] reported the sum of ranking differences (SRD) method, which compares the performance of the evaluated models to an ideal model and measures the difference [8]. The big advantage of the later method is that it compares objectively the models in a unique and easily traceable way. The only * Correspondence to: Dániel Szöllősi, Department of Physics and Control, Corvinus University of Budapest, Budapest, Hungary E-mail: daniel.szollosi@uni-corvinus.hu D. Szöllősi, D. L. Dénes, F. Firtha, Z. Kovács, A. Fekete Department of Physics and Control, Corvinus University of Budapest, Budapest, Hungary Special Issue Article Received: 15 November 2011, Revised: 2 February 2012, Accepted: 3 February 2012, Published online in Wiley Online Library: 2012 (wileyonlinelibrary.com) DOI: 10.1002/cem.2432 J. Chemometrics 2012; 26: 76–84 Copyright © 2012 John Wiley & Sons, Ltd. 76