Information theory approach to learning of the perceptron rule L. Diambra 1, * and J. Ferna ´ ndez 2 1 Departamento de Fisiologia e Biofı ´sica, ICB Universidade de Sa ˜ o Paulo, cep 05315-970, Sa ˜ o Paulo, Sa ˜ o Paulo, Brazil 2 Departamento de Fı ´sica, Universidad Nacional de La Plata, casilla de correo 67, 1900 La Plata, Argentina Received 31 January 2001; revised manuscript received 17 May 2001; published 20 September 2001 By recourse to a method based on information theory, we have studied the generalization problem in perceptrons. We considered different a priori distributions about the weights of the teacher perceptron. Our approach allows us to define the information gain from the examples used in the training procedure. The information gain can be used to choose a convenient example set for training the perceptron and to select the transfer function of the student perceptron. DOI: 10.1103/PhysRevE.64.046106 PACS numbers: 02.50.-r, 05.20.-y, 87.10.+e, 02.70.-c I. INTRODUCTION Neural networks exhibit remarkable properties for data processing, having found use in a wide variety of environ- ments such as identification and classification of physical objects, time series processing, and image reconstruction. Given a representative set of examples, with an effective learning scheme, such systems can indeed capture the essen- tial relationships and correlations that govern the pertinent class of input-output associations. This is evidenced both by accurate performance on training examples and by reliable generalizations or predictions for novel input patterns. Thus, trained networks are able to produce outputs corresponding to new inputs on the basis of an adequately selected working hypothesis. This working hypothesis is represented by a set of synaptic weights denoted by W* . Much effort has conse- quently been devoted to the task of developing suitable train- ing algorithms that are able to adjust the synaptic weights so as to enable the network to infer the correct answer when presented with a new input see 1,2 for a review. Information theory IT3 has proved to be of utility in devising learning techniques for perceptrons 4,5, and pro- vides a powerful framework for discussing questions related to the learning process, such as i how to incorporate our a priori information about the teacher perceptron TP; ii how to select the appropriate working hypothesis for the stu- dent perceptron SP; and iii how to choose convenient examples for the training procedure. Usually, training schemes are based on gradient descent algorithms on the training energy landscape E t . The training energy is defined by a cost function E t  W =  =1 p   W, S   1 where  ( W, S  ) is some measure of the deviation and p is the number of examples. This scheme is liable to become trapped in local minima of the energy surface with subse- quent poor generalization performance. In order to avoid this difficulty, a further generalization has been considered through incorporation of stochastic elements in the dynam- ics. In this refinement, the space of weights is explored by a stochastic learning process, i.e., a random walk on the train- ing energy landscape 1. Levin, Tishby, and Solla 6 showed that the stationary distribution of weights P ( W) is of Gibbsian character: Z -1 exp-E t (W)/ T  . The training energy is, in most cases, a complicated func- tion of W, with multiple valleys and hills. In particular, for perceptrons with binary weights, one encounters regions in the ( p , T ) plane that contain an enormous number of meta- stable states as the result of strong frustration while there is no indication of frustration for the continuous perceptron. Consequently, regarded as a relaxation phenomenon the training process can be an abnormally slow one 7. This, of course, constitutes a serious difficulty if one wishes to opti- mize the set of weights because the system can be trapped in a local minimum. We show that these troubles can be avoided by regarding the training process as an inference operation rather than as a relaxation phenomenon. The infer- ence process is to be accomplished according to Occam’s razor, i.e., with the minimum number of assumptions com- patible with the available data. Thus, the probability distri- bution is to be obtained by recourse to IT ideas, within the framework of Jaynes’ maximum entropy principle MEP 8–10. More specifically, we wish to investigate the prob- ability distribution that ensues in a situation in which each member of the training set is regarded as a constraint for the entropy maximization procedure. In the present work, the MEP is applied to the training of perceptrons supervised by a TP, with weight W 0 and transfer function g 0 , that provides a set of examples D p = S  ,  0   , with  =1, . . . , p . We consider here perceptrons with N in- put units S i connected to an output unit  whose state is determined according to  =g ( S• W), where g ( x ) is the transfer function of the output neuron. For each set of weights W the perceptron maps S on  . In order to select the working hypothesis W* for the SP, we infer the a posteriori distribution of weights P ( W| D p ), and then we adopt as the working hypothesis W* the configuration of weights that maximizes the a posteriori probability distribution P ( W| D p ) maximum likelihood criterion. The present approach offers an information measure as a bonus. This quantity, named the information gain, is defined from the a posteriori distribution P ( W| D p ) which carries information about the example set *Electronic address: diambra@fisio.icb.usp.br PHYSICAL REVIEW E, VOLUME 64, 046106 1063-651X/2001/644/0461068/$20.00 ©2001 The American Physical Society 64 046106-1