Neural network training without spurious minima L. Diambra* and A. Plastino ² Physics Department, National University La Plata, Casilla de Correro 727, 1900 La Plata, Argentina Received 2 January 1996 In training student perceptions, recourse to information theory concepts allows one to select the best working hypothesis and obtain an exact solution for the associated probability distribution. We apply this training scheme to perceptions with binary weights and show that no phase transition ensues. By recourse to our approach fast learning is guaranteed and trapping by spurious local minima is avoided. S1063- 651X9607505-8 PACS numbers: 87.10.+e, 05.20.-y, 02.70.-c I. INTRODUCTION Neural networks have been proposed as models for many cognitive functions; associative memory, generalization, cat- egorization, etc.. These functions appear as epiphenomena emergent collective behavior of the interconnected neural system. A well-documented situation is that of systems able to ‘‘learn’’ from examples. Great progress has been made by recourse to techniques of statistical mechanics in analyzing the performance of a student perceptron SP trained by a teacher perceptron TP1–3. Generalization is a characteristic ability of feedforward networks the perceptron, in particular. They exhibit infer- ence capacities, i.e., can produce outputs, corresponding to new inputs not previously presented by the TP, on the basis of an adequately selected working hypothesis WH. This hypothesis is, of course, represented by a set of synaptic weights W i that, when appropriately implemented, yields good generalization performance. Much effort has conse- quently been devoted to the task of developing suitable train- ing algorithms that are able to adjust the synaptic weights so as to enable the network to infer the correct answer when presented with a new input. In the present effort, a recently introduced 4,5 maximum entropy method is applied to perceptrons with binary weights 6,7. We consider here perceptrons with N input units S i connected to an output unit  whose state is determined ac- cording to =g S•W, where g ( x ) is the invertible transfer function of the output neuron. We assume that the network space is restricted to vectors that satisfy the normalization  i W i 2 =N . For each set of weights W the perceptron maps S on . In order to select the WH for the SP, we infer the TP state from the training set S  , 0  , with =1,•••,p , provided by a TP with weights W 0 , and transfer function g 0 our available information. The usual training schemes are stochastic processes that can be viewed as a random walk on the training energy land- scape. The training energy is defined by E t  W =  =1 p   W, S   , 1 where W,S is some measure of the deviation of the SP answer g S•W from the TP one, represented by g 0 S•W 0 . Levin, Tishby, and Sella 8, have shown that the stationary distribution of weights P W is of a Gibbsian character: Z -1 exp-E t ( W)/ T . The training energy is, in most cases, a complicated function of W, with multiple valleys and hills. In the ( p , T ) plane one encounters regions that contain an enormous number of metastables states as the result of a strong frustration1. The time required in order to sur- mount the free energy barrier is of the order of t e N f / T , where  f is the height of the free energy. Consequently, regarded as a relaxation phenomenon the training process can be an abnormally slow one 9. This, of course, consti- tutes a serious difficulty if one wishes to optimize the set of weights: the system can be trapped in a local minimum with a subsequent poor generalization performance. Here we intend to show that these troubles can be avoided by recourse to information theory IT ideas 10,11, that have proved to be of utility in devising learning schemes 4. In the present effort, the training process will not be regarded as a relaxation phenomenon but rather as an inference opera- tion. One wishes to infer the W state of the SP from the information conveyed by the training set. Our specific sug- gestion is that of adopting as a WH the configuration of weights that maximizes the entropy associated with the con- comitant probability distribution PD. This PD, in turn, is to be obtained by recourse to IT ideas, within the framework of Jaynes’ maximum entropy principle MEP11. More spe- cifically: we wish to investigate the probability distribution that ensues in that situation in which each member of the training set is regarded as a constraint for the entropy maxi- mization procedure4. The paper is organized as follows: in Sec. II we review the MEP method for the obtention of the associated probabil- ity distribution. The a priori probability distribution is intro- duced. In Sec. III we examine two different a priori prob- ability distributions and an interesting limit case is analyzed. The generalization performance is discussed in Sec. IV, and some conclusions are drawn in Sec. V. II. THE MEP APPROACH In IT parlance, a given fixed set of observables, referred to as the ‘‘relevant’’ ones in order to build up the pertinent statistical operator, constitutes the so-called observation level 12. In dealing with neural networks, one can use the infor- * Electronic address: diambra@venus.fisica.unlp.edu.ar ² Electronic address: plastino@venus.fisica.unlp.edu.ar PHYSICAL REVIEW E MAY 1996 VOLUME 53, NUMBER 5 53 1063-651X/96/535/51904/$10.00 5190 © 1996 The American Physical Society