Information theory approach to learning of the perceptron rule
L. Diambra
1,
* and J. Ferna
´
ndez
2
1
Departamento de Fisiologia e Biofı ´sica, ICB Universidade de Sa ˜ o Paulo, cep 05315-970, Sa ˜ o Paulo, Sa ˜ o Paulo, Brazil
2
Departamento de Fı ´sica, Universidad Nacional de La Plata, casilla de correo 67, 1900 La Plata, Argentina
Received 31 January 2001; revised manuscript received 17 May 2001; published 20 September 2001
By recourse to a method based on information theory, we have studied the generalization problem in
perceptrons. We considered different a priori distributions about the weights of the teacher perceptron. Our
approach allows us to define the information gain from the examples used in the training procedure. The
information gain can be used to choose a convenient example set for training the perceptron and to select the
transfer function of the student perceptron.
DOI: 10.1103/PhysRevE.64.046106 PACS numbers: 02.50.-r, 05.20.-y, 87.10.+e, 02.70.-c
I. INTRODUCTION
Neural networks exhibit remarkable properties for data
processing, having found use in a wide variety of environ-
ments such as identification and classification of physical
objects, time series processing, and image reconstruction.
Given a representative set of examples, with an effective
learning scheme, such systems can indeed capture the essen-
tial relationships and correlations that govern the pertinent
class of input-output associations. This is evidenced both by
accurate performance on training examples and by reliable
generalizations or predictions for novel input patterns. Thus,
trained networks are able to produce outputs corresponding
to new inputs on the basis of an adequately selected working
hypothesis. This working hypothesis is represented by a set
of synaptic weights denoted by W* . Much effort has conse-
quently been devoted to the task of developing suitable train-
ing algorithms that are able to adjust the synaptic weights so
as to enable the network to infer the correct answer when
presented with a new input see 1,2 for a review.
Information theory IT3 has proved to be of utility in
devising learning techniques for perceptrons 4,5, and pro-
vides a powerful framework for discussing questions related
to the learning process, such as i how to incorporate our a
priori information about the teacher perceptron TP; ii
how to select the appropriate working hypothesis for the stu-
dent perceptron SP; and iii how to choose convenient
examples for the training procedure.
Usually, training schemes are based on gradient descent
algorithms on the training energy landscape E
t
. The training
energy is defined by a cost function
E
t
W =
=1
p
W, S
1
where ( W, S
) is some measure of the deviation and p is
the number of examples. This scheme is liable to become
trapped in local minima of the energy surface with subse-
quent poor generalization performance. In order to avoid this
difficulty, a further generalization has been considered
through incorporation of stochastic elements in the dynam-
ics. In this refinement, the space of weights is explored by a
stochastic learning process, i.e., a random walk on the train-
ing energy landscape 1. Levin, Tishby, and Solla 6
showed that the stationary distribution of weights P ( W) is of
Gibbsian character: Z
-1
exp-E
t
(W)/ T .
The training energy is, in most cases, a complicated func-
tion of W, with multiple valleys and hills. In particular, for
perceptrons with binary weights, one encounters regions in
the ( p , T ) plane that contain an enormous number of meta-
stable states as the result of strong frustration while there is
no indication of frustration for the continuous perceptron.
Consequently, regarded as a relaxation phenomenon the
training process can be an abnormally slow one 7. This, of
course, constitutes a serious difficulty if one wishes to opti-
mize the set of weights because the system can be trapped in
a local minimum. We show that these troubles can be
avoided by regarding the training process as an inference
operation rather than as a relaxation phenomenon. The infer-
ence process is to be accomplished according to Occam’s
razor, i.e., with the minimum number of assumptions com-
patible with the available data. Thus, the probability distri-
bution is to be obtained by recourse to IT ideas, within the
framework of Jaynes’ maximum entropy principle MEP
8–10. More specifically, we wish to investigate the prob-
ability distribution that ensues in a situation in which each
member of the training set is regarded as a constraint for the
entropy maximization procedure.
In the present work, the MEP is applied to the training of
perceptrons supervised by a TP, with weight W
0
and transfer
function g
0
, that provides a set of examples D
p
= S
,
0
,
with =1, . . . , p . We consider here perceptrons with N in-
put units S
i
connected to an output unit whose state is
determined according to =g ( S• W), where g ( x ) is the
transfer function of the output neuron. For each set of
weights W the perceptron maps S on . In order to select the
working hypothesis W* for the SP, we infer the a posteriori
distribution of weights P ( W| D
p
), and then we adopt as the
working hypothesis W* the configuration of weights that
maximizes the a posteriori probability distribution P ( W| D
p
)
maximum likelihood criterion. The present approach offers
an information measure as a bonus. This quantity, named the
information gain, is defined from the a posteriori distribution
P ( W| D
p
) which carries information about the example set *Electronic address: diambra@fisio.icb.usp.br
PHYSICAL REVIEW E, VOLUME 64, 046106
1063-651X/2001/644/0461068/$20.00 ©2001 The American Physical Society 64 046106-1