Neural network training without spurious minima
L. Diambra* and A. Plastino
²
Physics Department, National University La Plata, Casilla de Correro 727, 1900 La Plata, Argentina
Received 2 January 1996
In training student perceptions, recourse to information theory concepts allows one to select the best working
hypothesis and obtain an exact solution for the associated probability distribution. We apply this training
scheme to perceptions with binary weights and show that no phase transition ensues. By recourse to our
approach fast learning is guaranteed and trapping by spurious local minima is avoided. S1063-
651X9607505-8
PACS numbers: 87.10.+e, 05.20.-y, 02.70.-c
I. INTRODUCTION
Neural networks have been proposed as models for many
cognitive functions; associative memory, generalization, cat-
egorization, etc.. These functions appear as epiphenomena
emergent collective behavior of the interconnected neural
system. A well-documented situation is that of systems able
to ‘‘learn’’ from examples. Great progress has been made by
recourse to techniques of statistical mechanics in analyzing
the performance of a student perceptron SP trained by a
teacher perceptron TP1–3.
Generalization is a characteristic ability of feedforward
networks the perceptron, in particular. They exhibit infer-
ence capacities, i.e., can produce outputs, corresponding to
new inputs not previously presented by the TP, on the basis
of an adequately selected working hypothesis WH. This
hypothesis is, of course, represented by a set of synaptic
weights W
i
that, when appropriately implemented, yields
good generalization performance. Much effort has conse-
quently been devoted to the task of developing suitable train-
ing algorithms that are able to adjust the synaptic weights so
as to enable the network to infer the correct answer when
presented with a new input.
In the present effort, a recently introduced 4,5 maximum
entropy method is applied to perceptrons with binary weights
6,7. We consider here perceptrons with N input units S
i
connected to an output unit whose state is determined ac-
cording to =g S•W, where g ( x ) is the invertible transfer
function of the output neuron. We assume that the network
space is restricted to vectors that satisfy the normalization
i
W
i
2
=N . For each set of weights W the perceptron maps S
on . In order to select the WH for the SP, we infer the TP
state from the training set S
,
0
, with =1,•••,p , provided
by a TP with weights W
0
, and transfer function g
0
our
available information.
The usual training schemes are stochastic processes that
can be viewed as a random walk on the training energy land-
scape. The training energy is defined by
E
t
W =
=1
p
W, S
, 1
where W,S is some measure of the deviation of the SP
answer g S•W from the TP one, represented by g
0
S•W
0
.
Levin, Tishby, and Sella 8, have shown that the stationary
distribution of weights P W is of a Gibbsian character:
Z
-1
exp-E
t
( W)/ T . The training energy is, in most cases, a
complicated function of W, with multiple valleys and hills.
In the ( p , T ) plane one encounters regions that contain an
enormous number of metastables states as the result of a
strong frustration1. The time required in order to sur-
mount the free energy barrier is of the order of t e
N f / T
,
where f is the height of the free energy. Consequently,
regarded as a relaxation phenomenon the training process
can be an abnormally slow one 9. This, of course, consti-
tutes a serious difficulty if one wishes to optimize the set of
weights: the system can be trapped in a local minimum with
a subsequent poor generalization performance.
Here we intend to show that these troubles can be avoided
by recourse to information theory IT ideas 10,11, that
have proved to be of utility in devising learning schemes 4.
In the present effort, the training process will not be regarded
as a relaxation phenomenon but rather as an inference opera-
tion. One wishes to infer the W state of the SP from the
information conveyed by the training set. Our specific sug-
gestion is that of adopting as a WH the configuration of
weights that maximizes the entropy associated with the con-
comitant probability distribution PD. This PD, in turn, is to
be obtained by recourse to IT ideas, within the framework of
Jaynes’ maximum entropy principle MEP11. More spe-
cifically: we wish to investigate the probability distribution
that ensues in that situation in which each member of the
training set is regarded as a constraint for the entropy maxi-
mization procedure4.
The paper is organized as follows: in Sec. II we review
the MEP method for the obtention of the associated probabil-
ity distribution. The a priori probability distribution is intro-
duced. In Sec. III we examine two different a priori prob-
ability distributions and an interesting limit case is analyzed.
The generalization performance is discussed in Sec. IV, and
some conclusions are drawn in Sec. V.
II. THE MEP APPROACH
In IT parlance, a given fixed set of observables, referred
to as the ‘‘relevant’’ ones in order to build up the pertinent
statistical operator, constitutes the so-called observation level
12. In dealing with neural networks, one can use the infor-
*
Electronic address: diambra@venus.fisica.unlp.edu.ar
²
Electronic address: plastino@venus.fisica.unlp.edu.ar
PHYSICAL REVIEW E MAY 1996 VOLUME 53, NUMBER 5
53 1063-651X/96/535/51904/$10.00 5190 © 1996 The American Physical Society