Comparison of six multiclass classiers by the use of different classication performance indicators Dániel Szöllősi*, Dénes Lajos Dénes, Ferenc Firtha, Zoltán Kovács and András Fekete Classication problems are very important, and generally, the question is which is the best model. Several classica- tion performance indicators including the classication accuracy value (ACC), Cohens kappa (KAPPA), or the area under the ROC curve (AUC) are used to answer this question. There are non-parametric comparative methods such as the sum of ranking differences method. The objective of this work was to nd the best classication method to classify four soft drink samples and four model samples, which differ from each other only in the sweetener compo- sition. Model samples were used to be basic samples for comparison with the commercial soft drinks. Six different classication methods were compared according to their classication performance. A corrected classication accu- racy value (corrected ACC) was developed for the purpose and was introduced. This value takes into account the sim- ilarities between the classes. The results showed that the ACC value and the KAPPA values give similar results in our case. The best three models according to the ACC, KAPPA, and AUC were K-nearest neighbor,”“random forest,and discriminant analysis.However, the corrected ACC value showed a bit different ranking, and the random forest model was neglected from the good models. The confusion matrices of the models conrmed the ranking according to the corrected ACC value. The results showed that the best classication model was the K-nearest neighbor for the available samples, and the corrected ACC value is a useful classication performance indicator. Copyright © 2012 John Wiley & Sons, Ltd. Keywords: multiclass classier; classication performance; classication accuracy; Cohens kappa; AUC; sum of ranking differences 1. INTRODUCTION There is a denite need for multiclass classication models in the eld of data processing and evaluation. Several classication methods are available, such as linear discriminant analysis (DA), K-nearest neighbor (KN), and decision tree (DT). The comparison of different models can be performed by the classication per- formance obtained from the validation of the models. Therefore, the validation method is a critical point of the model building [1]. Generally, internal validation is used, which is usually not the best but the only method that is available. A better validation method is the external validation, but unfortunately, we have rarely two or more independent data sets. Regardless of the validation procedure, the model is usually selected according to classication performance indicators. The most widely used indicator is the classication accuracy value (ACC). This parame- ter is calculated from the confusion matrix and compares the correctly classied cases to all cases. The biggest advantage of ACC is the simplicity. However, it also has many drawbacks. When using ACC, we assume that the error that is made because of the misclassication is the same in every possible case. Never- theless, it is a very rare situation, and in this case, the results can be misleading [2,3]. A better indicator can be Cohens kappa, which takes into account the random correct classication and gives a chance corrected coefcient of agreement[4]. If the cost of misclassication is unknown, then the best approach for objective model comparison is the area under the ROC curve, which is also known as AUC, described by Bradley [5]. The AUC gives an overall evaluation about classication abilities of the models. However, according to Hand [6], the AUC has some disadvantages. The AUC gives misleading information if the ROC curves are crossing. Moreover it is fundamentally incoher- ent in terms of misclassication costs: the AUC uses different misclassication cost distributions for different classiers[6]. Hand recommended another indicator called H measureto solve these problems, but unfortunately, it can be used only for two-class classication [6]. Moreover, according to Marzban [7], the AUC value can discriminate well between goodand bad models but hardly between two goodmodels. There are also non-parametric model comparison methods. Héberger and Kollár-Hunek [8] reported the sum of ranking differences (SRD) method, which compares the performance of the evaluated models to an ideal model and measures the difference [8]. The big advantage of the later method is that it compares objectively the models in a unique and easily traceable way. The only * Correspondence to: Dániel Szöllősi, Department of Physics and Control, Corvinus University of Budapest, Budapest, Hungary E-mail: daniel.szollosi@uni-corvinus.hu D. Szöllősi, D. L. Dénes, F. Firtha, Z. Kovács, A. Fekete Department of Physics and Control, Corvinus University of Budapest, Budapest, Hungary Special Issue Article Received: 15 November 2011, Revised: 2 February 2012, Accepted: 3 February 2012, Published online in Wiley Online Library: 2012 (wileyonlinelibrary.com) DOI: 10.1002/cem.2432 J. Chemometrics 2012; 26: 7684 Copyright © 2012 John Wiley & Sons, Ltd. 76