1204 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 Any Reasonable Cost Function Can be Used for A Posteriori Probability Approximation Marco Saerens, Patrice Latinne, and Christine Decaestecker Abstract—In this paper, we provide a straightforward proof of an important, but nevertheless little known, result obtained by Lindley in the framework of subjective probability theory. This result, once interpreted in the machine learning/pattern recognition context, puts new light on the probabilistic interpre- tation of the output of a trained classifier. A learning machine, or more generally a model, is usually trained by minimizing a criterion—the expectation of the cost function—measuring the discrepancy between the model output and the desired output. In this letter, we first show that, for the binary classification case, training the model with any “reasonable cost function” can lead to Bayesian a posteriori probability estimation. Indeed, after having trained the model by minimizing the criterion, there always exists a computable transformation that maps the output of the model to the Bayesian a posteriori probability of the class membership given the input. Then, necessary conditions allowing the computation of the transformation mapping the outputs of the model to the a posteriori probabilities are derived for the multioutput case. Finally, these theoretical results are illustrated through some simulation examples involving various cost functions. Index Terms—A posteriori probabilities, Bayes decision making, cost function, loss function, training criterion. I. INTRODUCTION A N important problem concerns the probabilistic interpre- tation to be given to the output of a learning machine, or more generally a model, after training. It appears that this probabilistic interpretation depends on the cost function used for training. Classification models are almost always trained by minimizing a given criterion, the expectation of the cost func- tion. It is therefore of fundamental importance to have a precise idea of what can be achieved with the choice of this criterion. Consequently, there has been considerable interest in ana- lyzing the properties of the mean square error criterion—the most commonly used criterion. It is well known, for instance, that artificial neural nets (or more generally any model), when trained using the mean square error criterion, produce as output an approximation of the expected value of the desired output Manuscript received January 8, 2001; revised September 21, 2001 and Feb- ruary 13, 2002. This work was supported in part by the “Région de Bruxelles- Capitale” under Project RBC-BR 216/4041 and from the SmalS-MvM. The work of P. Latinne was supported by an Action de Recherche Concertée (ARC) program of the Communauté Française de Belgique. M. Saerens is with IRIDIA Laboratory (Artificial Intelligence Labora- tory), Université Libre de Bruxelles, B-1050 Brussels, Belgium and is also with SmalS-MvM, Research Section, B-1050 Brussels, Belgium (e-mail: saerens@ulb.ac.be). P. Latinne is with IRIDIA Laboratory (Artificial Intelligence Labora- tory), Université Libre de Bruxelles, B-1050 Brussels, Belgium (e-mail: platinne@ulb.ac.be). C. Decaestecker is with the Belgian Research Founds (F.N.R.S.) at the Labo- ratory of Histopathology, Université Libre de Bruxelles, B-1070 Brussels, Bel- gium (e-mail: cdecaes@ulb.ac.be). Publisher Item Identifier S 1045-9227(02)05572-8. conditional on the explanatory input variables if “perfect training” is achieved (see, for instance, [1] and [5]). We say that perfect training is achieved if • a minimum of the criterion is indeed reached after training; • the learning machine is a “sufficiently powerful model” that is able to approximate the optimal estimator to any degree of accuracy (perfect model matching property). It has also been shown that other cost functions, for instance the cross-entropy between the desired output and the model output in the case of pattern classification, lead to the same prop- erty of approximating the conditional expectation of the desired output as well. We may, therefore, wonder what conditions a cost function should satisfy in order that the model output has this property. In 1991, following the results of Hampshire and Pearlmutter [3], Miller et al. [7], [8] answered this question by providing conditions on the cost function ensuring that the output of the model approximates the conditional expectation of the desired output given the input, in the case of perfect training. These results were rederived by Saerens by using the calculus of variations [9], and were then extended to the conditional me- dian [10]. Also, in [10], a close relationship between the condi- tions on the cost function ensuring that the output of the model approximates the conditional probability of the desired output given the input, when the performance criterion is minimized, and the quasilikelihood functions used in the context of applied statistics (generalized linear models; see [6]) was pointed out. In this work, we focus on classification, in which case the model will be called a classifier. In this framework, we show that, for the binary classification case, training the classifier with any reasonable cost function leads to a posteriori probability es- timation. Indeed, after having trained the model by minimizing the criterion, there always exists a computable transformation that maps the output of the model to the a posteriori probability of the class label. This means that we are free to choose any reasonable cost function we want, and train the classifier with it. We can always remap the output of the model afterwards to the a posteriori probability, for Bayesian decision making. We will see that this property generalizes to a certain extend to the multioutput case. This important result was proved by Lindley in 1982, in the context of subjective probability theory [4]. Briefly, Lindley considered the case where a person expresses his uncertainty about an event , conditional upon an event , by assigning a number (we use Lindley’s notations). For example, consider a physician who, after the medical examination of a patient, has to express his uncertainty about the diagnosis of a given dis- ease , conditional on the result of the examination. This person then receives a score which is function of and the truth or falsity of when is true [where is an indicator 1045-9227/02$17.00 © 2002 IEEE