Comparison of six multiclass classifiers by
the use of different classification
performance indicators
Dániel Szöllősi*, Dénes Lajos Dénes, Ferenc Firtha, Zoltán Kovács and
András Fekete
Classification problems are very important, and generally, the question is which is the best model. Several classifica-
tion performance indicators including the classification accuracy value (ACC), Cohen’s kappa (KAPPA), or the area
under the ROC curve (AUC) are used to answer this question. There are non-parametric comparative methods such
as the sum of ranking differences method. The objective of this work was to find the best classification method to
classify four soft drink samples and four model samples, which differ from each other only in the sweetener compo-
sition. Model samples were used to be basic samples for comparison with the commercial soft drinks. Six different
classification methods were compared according to their classification performance. A corrected classification accu-
racy value (corrected ACC) was developed for the purpose and was introduced. This value takes into account the sim-
ilarities between the classes. The results showed that the ACC value and the KAPPA values give similar results in our
case. The best three models according to the ACC, KAPPA, and AUC were “K-nearest neighbor,”“random forest,” and
“discriminant analysis.” However, the corrected ACC value showed a bit different ranking, and the random forest
model was neglected from the good models. The confusion matrices of the models confirmed the ranking according
to the corrected ACC value. The results showed that the best classification model was the K-nearest neighbor for the
available samples, and the corrected ACC value is a useful classification performance indicator. Copyright © 2012
John Wiley & Sons, Ltd.
Keywords: multiclass classifier; classification performance; classification accuracy; Cohen’ s kappa; AUC; sum of ranking
differences
1. INTRODUCTION
There is a definite need for multiclass classification models in the
field of data processing and evaluation. Several classification
methods are available, such as linear discriminant analysis (DA),
K-nearest neighbor (KN), and decision tree (DT). The comparison
of different models can be performed by the classification per-
formance obtained from the validation of the models. Therefore,
the validation method is a critical point of the model building [1].
Generally, internal validation is used, which is usually not the
best but the only method that is available. A better validation
method is the external validation, but unfortunately, we have
rarely two or more independent data sets. Regardless of the
validation procedure, the model is usually selected according
to classification performance indicators. The most widely used
indicator is the classification accuracy value (ACC). This parame-
ter is calculated from the confusion matrix and compares the
correctly classified cases to all cases. The biggest advantage of
ACC is the simplicity. However, it also has many drawbacks.
When using ACC, we assume that the error that is made because
of the misclassification is the same in every possible case. Never-
theless, it is a very rare situation, and in this case, the results can
be misleading [2,3]. A better indicator can be Cohen’ s kappa,
which takes into account the random correct classification and
gives a “chance corrected coefficient of agreement” [4]. If the
cost of misclassification is unknown, then the best approach
for objective model comparison is the area under the ROC curve,
which is also known as AUC, described by Bradley [5]. The AUC
gives an overall evaluation about classification abilities of the
models. However, according to Hand [6], the AUC has some
disadvantages. The AUC gives misleading information if the
ROC curves are crossing. Moreover “it is fundamentally incoher-
ent in terms of misclassification costs: the AUC uses different
misclassification cost distributions for different classifiers” [6].
Hand recommended another indicator called “H measure” to
solve these problems, but unfortunately, it can be used only for
two-class classification [6]. Moreover, according to Marzban [7],
the AUC value can discriminate well between “good” and “bad”
models but hardly between two “good” models. There are also
non-parametric model comparison methods. Héberger and
Kollár-Hunek [8] reported the sum of ranking differences (SRD)
method, which compares the performance of the evaluated
models to an ideal model and measures the difference [8]. The
big advantage of the later method is that it compares objectively
the models in a unique and easily traceable way. The only
* Correspondence to: Dániel Szöllősi, Department of Physics and Control, Corvinus
University of Budapest, Budapest, Hungary
E-mail: daniel.szollosi@uni-corvinus.hu
D. Szöllősi, D. L. Dénes, F. Firtha, Z. Kovács, A. Fekete
Department of Physics and Control, Corvinus University of Budapest, Budapest,
Hungary
Special Issue Article
Received: 15 November 2011, Revised: 2 February 2012, Accepted: 3 February 2012, Published online in Wiley Online Library: 2012
(wileyonlinelibrary.com) DOI: 10.1002/cem.2432
J. Chemometrics 2012; 26: 76–84 Copyright © 2012 John Wiley & Sons, Ltd.
76