Probabilistic performance evaluation for multiclass classification using the posterior balanced accuracy Henry Carrillo 1 , Kay H. Brodersen 2 , and Jos´ e A. Castellanos 1 1 Instituto de Investigaci´ on en Ingenier´ ıa de Arag ´ on, Universidad de Zaragoza, C/ Mar´ ıa de Luna 1, 50018, Zaragoza, Spain {hcarri,jacaste}@unizar.es 2 Translational Neuromodeling Unit, Department of Information Technology and Electrical Engineering, Swiss Federal Institute of Technology (ETH Zurich), 8032 Zurich, Switzerland brodersen@biomed.ee.ethz.ch Abstract. An important problem in robotics is the empirical evaluation of clas- sification algorithms that allow a robotic system to make accurate categorical predictions about its environment. Current algorithms are often assessed using sample statistics that can be difficult to interpret correctly and do not always provide a principled way of comparing competing algorithms. In this paper, we present a probabilistic alternative based on a Bayesian framework for inferring on balanced accuracies. Using the proposed probabilistic evaluation, it is possible to assess the balanced accuracy’s posterior distribution of binary and multiclass classifiers. In addition, competing classifiers can be compared based on their re- spective posterior distributions. We illustrate the practical utility of our scheme and its properties by reanalyzing the performance of a recently published algo- rithm in the domain of visual action detection and on synthetic data. To facilitate its use, we provide an open-source MATLAB implementation. Keywords: multiclass classifiers, accuracy, balanced accuracy, probabilistic per- formance 1 Introduction A central theme in the development of intelligent, autonomous robots has been the challenge of decision problems. Typical examples include the critical tasks of object detection [3,14], scene recognition [8,27], active SLAM [10,11] or loop closing [15,17]. All of these domains have seen significant progress in the development of increasingly accurate classification algorithms. By contrast, there has been less focus on the evaluation of the performance of such algorithms. Assessing the performance of a given classifier is crucial as it allows us to (i) obtain an interpretable estimate of the degree to which its results generalize to unseen examples from the same distribution from which the existing data were drawn, (ii) compare competing approaches, and (iii) tune the (hyper)parameters of a classifier in light of the estimated performance in a given domain. A common basis for evaluating the performance of a classifier is the confusion matrix. It provides a summary of classification outcomes and permits the inspection of