1204 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
Any Reasonable Cost Function Can be Used for A Posteriori
Probability Approximation
Marco Saerens, Patrice Latinne, and Christine Decaestecker
Abstract—In this paper, we provide a straightforward proof
of an important, but nevertheless little known, result obtained
by Lindley in the framework of subjective probability theory.
This result, once interpreted in the machine learning/pattern
recognition context, puts new light on the probabilistic interpre-
tation of the output of a trained classifier. A learning machine,
or more generally a model, is usually trained by minimizing a
criterion—the expectation of the cost function—measuring the
discrepancy between the model output and the desired output. In
this letter, we first show that, for the binary classification case,
training the model with any “reasonable cost function” can lead to
Bayesian a posteriori probability estimation. Indeed, after having
trained the model by minimizing the criterion, there always exists
a computable transformation that maps the output of the model to
the Bayesian a posteriori probability of the class membership given
the input. Then, necessary conditions allowing the computation
of the transformation mapping the outputs of the model to the
a posteriori probabilities are derived for the multioutput case.
Finally, these theoretical results are illustrated through some
simulation examples involving various cost functions.
Index Terms—A posteriori probabilities, Bayes decision making,
cost function, loss function, training criterion.
I. INTRODUCTION
A
N important problem concerns the probabilistic interpre-
tation to be given to the output of a learning machine,
or more generally a model, after training. It appears that this
probabilistic interpretation depends on the cost function used
for training. Classification models are almost always trained by
minimizing a given criterion, the expectation of the cost func-
tion. It is therefore of fundamental importance to have a precise
idea of what can be achieved with the choice of this criterion.
Consequently, there has been considerable interest in ana-
lyzing the properties of the mean square error criterion—the
most commonly used criterion. It is well known, for instance,
that artificial neural nets (or more generally any model), when
trained using the mean square error criterion, produce as output
an approximation of the expected value of the desired output
Manuscript received January 8, 2001; revised September 21, 2001 and Feb-
ruary 13, 2002. This work was supported in part by the “Région de Bruxelles-
Capitale” under Project RBC-BR 216/4041 and from the SmalS-MvM. The
work of P. Latinne was supported by an Action de Recherche Concertée (ARC)
program of the Communauté Française de Belgique.
M. Saerens is with IRIDIA Laboratory (Artificial Intelligence Labora-
tory), Université Libre de Bruxelles, B-1050 Brussels, Belgium and is also
with SmalS-MvM, Research Section, B-1050 Brussels, Belgium (e-mail:
saerens@ulb.ac.be).
P. Latinne is with IRIDIA Laboratory (Artificial Intelligence Labora-
tory), Université Libre de Bruxelles, B-1050 Brussels, Belgium (e-mail:
platinne@ulb.ac.be).
C. Decaestecker is with the Belgian Research Founds (F.N.R.S.) at the Labo-
ratory of Histopathology, Université Libre de Bruxelles, B-1070 Brussels, Bel-
gium (e-mail: cdecaes@ulb.ac.be).
Publisher Item Identifier S 1045-9227(02)05572-8.
conditional on the explanatory input variables if “perfect
training” is achieved (see, for instance, [1] and [5]). We say
that perfect training is achieved if
• a minimum of the criterion is indeed reached after training;
• the learning machine is a “sufficiently powerful model”
that is able to approximate the optimal estimator to any
degree of accuracy (perfect model matching property).
It has also been shown that other cost functions, for instance
the cross-entropy between the desired output and the model
output in the case of pattern classification, lead to the same prop-
erty of approximating the conditional expectation of the desired
output as well. We may, therefore, wonder what conditions a
cost function should satisfy in order that the model output has
this property. In 1991, following the results of Hampshire and
Pearlmutter [3], Miller et al. [7], [8] answered this question
by providing conditions on the cost function ensuring that the
output of the model approximates the conditional expectation of
the desired output given the input, in the case of perfect training.
These results were rederived by Saerens by using the calculus
of variations [9], and were then extended to the conditional me-
dian [10]. Also, in [10], a close relationship between the condi-
tions on the cost function ensuring that the output of the model
approximates the conditional probability of the desired output
given the input, when the performance criterion is minimized,
and the quasilikelihood functions used in the context of applied
statistics (generalized linear models; see [6]) was pointed out.
In this work, we focus on classification, in which case the
model will be called a classifier. In this framework, we show
that, for the binary classification case, training the classifier with
any reasonable cost function leads to a posteriori probability es-
timation. Indeed, after having trained the model by minimizing
the criterion, there always exists a computable transformation
that maps the output of the model to the a posteriori probability
of the class label. This means that we are free to choose any
reasonable cost function we want, and train the classifier with
it. We can always remap the output of the model afterwards to
the a posteriori probability, for Bayesian decision making. We
will see that this property generalizes to a certain extend to the
multioutput case.
This important result was proved by Lindley in 1982, in the
context of subjective probability theory [4]. Briefly, Lindley
considered the case where a person expresses his uncertainty
about an event , conditional upon an event , by assigning a
number (we use Lindley’s notations). For example, consider
a physician who, after the medical examination of a patient, has
to express his uncertainty about the diagnosis of a given dis-
ease , conditional on the result of the examination. This
person then receives a score which is function of and
the truth or falsity of when is true [where is an indicator
1045-9227/02$17.00 © 2002 IEEE