Probabilistic performance evaluation for multiclass classiﬁcation using the posterior balanced accuracy Henry Carrillo 1 , Kay H. Brodersen 2 , and Jos´ e A. Castellanos 1 1 Instituto de Investigaci´ on en Ingenier´ ıa de Arag ´ on, Universidad de Zaragoza, C/ Mar´ ıa de Luna 1, 50018, Zaragoza, Spain {hcarri,jacaste}@unizar.es 2 Translational Neuromodeling Unit, Department of Information Technology and Electrical Engineering, Swiss Federal Institute of Technology (ETH Zurich), 8032 Zurich, Switzerland brodersen@biomed.ee.ethz.ch Abstract. An important problem in robotics is the empirical evaluation of clas- siﬁcation algorithms that allow a robotic system to make accurate categorical predictions about its environment. Current algorithms are often assessed using sample statistics that can be difﬁcult to interpret correctly and do not always provide a principled way of comparing competing algorithms. In this paper, we present a probabilistic alternative based on a Bayesian framework for inferring on balanced accuracies. Using the proposed probabilistic evaluation, it is possible to assess the balanced accuracy’s posterior distribution of binary and multiclass classiﬁers. In addition, competing classiﬁers can be compared based on their re- spective posterior distributions. We illustrate the practical utility of our scheme and its properties by reanalyzing the performance of a recently published algo- rithm in the domain of visual action detection and on synthetic data. To facilitate its use, we provide an open-source MATLAB implementation. Keywords: multiclass classiﬁers, accuracy, balanced accuracy, probabilistic per- formance 1 Introduction A central theme in the development of intelligent, autonomous robots has been the challenge of decision problems. Typical examples include the critical tasks of object detection [3,14], scene recognition [8,27], active SLAM [10,11] or loop closing [15,17]. All of these domains have seen signiﬁcant progress in the development of increasingly accurate classiﬁcation algorithms. By contrast, there has been less focus on the evaluation of the performance of such algorithms. Assessing the performance of a given classiﬁer is crucial as it allows us to (i) obtain an interpretable estimate of the degree to which its results generalize to unseen examples from the same distribution from which the existing data were drawn, (ii) compare competing approaches, and (iii) tune the (hyper)parameters of a classiﬁer in light of the estimated performance in a given domain. A common basis for evaluating the performance of a classiﬁer is the confusion matrix. It provides a summary of classiﬁcation outcomes and permits the inspection of