IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 837 Learning in Linear Neural Networks: A Survey Pierre F. Baldi and Kurt Homik, Member, IEEE Absfract- Networks of linear units are the simplest kind of networks, where the basic questions related to learning, gen- eralization, and self-organization can sometimes be answered analytically. We survey most of the known results on linear net- works, including: 1) backpropagation learning and the structure of the error function landscape, 2) the temporal evolution of generalization, and 3) unsupervised learning algorithmsand their properties. The connections to classical statistical ideas, such as principal component analysis (PCA), are emphasized as well as several simple but challenging open questions. A few new results are also spread across the paper, including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. I. INTRODUCTION HIS paper addresses the problems of supervised and T unsupervised learning in layered networks of linear units and, together with a few new results, reviews most of the recent literature on the subject. One may expect the topic to be fairly restricted, yet it is in fact quite rich and far from being exhausted. Since the first approximations of biological neurons using threshold gates [ll, the nonlinear aspects of neural computations and hardware have often been emphasized and linear networks dismissed as uninteresting for being able to express linear input-output maps only. Furthermore, multiple layers of linear units can always be collapsed by multiplying the corresponding weight matrices. So why bother? Non- linear computations are obviously extremely important, but these arguments should be considered as very suspicious; by stressing the input-output relations only, they miss the subtle problems of dynamics, structure, and organization that normally arise during learning and plasticity, even in simple linear systems. There are other reasons why linear networks deserve careful attention. General results in the nonlinear case are often absent or difficult to derive analytically, whereas the linear case can often be analyzed in mathematical detail. As in the theory of differential equations, the linear setting should be regarded as the first simple case to be studied. More complex situations can often be investigated by linearization, although this has not been attempted systematically in neural networks, for instance in the analysis of backpropagation learning. In backpropagation, learning is often started with zero or small random initial weights and biases. Thus, at least during the initial phase of training, the network is operating Manuscript received September 25, 1992; revised June 19, 1994. This work P. F. Baldi is with the Jet Propulsion Laboratory, and Division of Biology, K. Homik is with the Institut fiir Statistik und Wahrscheinlichkeitstheorie, IEEE Log Number 9409158. was supported in part by grants from NSF, AFOSR, and ONR. California Institute of Technology, Pasadena, CA 91 109 USA. Technische Universittit Wien, A-1040 Vienna, Austria. B A n output units p hidden units n input units Fig. 1. The basic network in the autoassociative case (m = n). in its linear regime. Even when training is completed, one often finds several units in the network which are operating in their linear range. From the standpoint of theoretical biology, it has been argued that certain classes of neurons may be operating most of the time in a linear or quasi-linear regime and linear input-output relations seem to hold for certain specific biological circuits (see [2] for an example). Finally, the study of linear networks leads to new interesting questions, insights, and paradigms which could not have been guessed in advance and to new ways of looking at certain classical statistical techniques. To begin with, we shall consider a linear network with an n-pm architecture comprising one input layer, one hidden layer, and one output layer with n, p, and m units, respectively (Fig. 1). The more general case, with, for instance, multiple hidden layers, can be reduced to this simple setting as we shall see. A will usually denote the p x n matrix connecting the input to the middle layer and B the m x p matrix of connection weights from the middle layer to the output. Thus, for instance, bij represents the strength of the coupling between the jth hidden unit and the ith output unit (double indexes are always in the post-presynaptic order). The network therefore computes the linear function y = BAT. In the usual learning from examples setting, we assume that a set of n-dimensional input patterns xt (1 5 t 5 T) is given together with a corresponding set of m-dimensional target output patterns yt (1 5 t 5 T) (all vectors are assumed to be column vectors). X = [XI, . . . , ZT] and Y = [yl,...,y~] are the n x T and m x T matrices having the patterns as their columns. Because of the need for target outputs, this form of learning will also be called supervised. For simplicity, unless otherwise stated, all the patterns are assumed to be centered (i.e., (z) = (y) = 0). The symbol ‘‘(.)” will be used for averages over the set of patterns or sometimes over the pattern distribution, depending on the context. The approximation of one by the other is a central problem in statistics, but is not our main concern here. The environment is supposed to be stationary but the results could be extended to a slowly varying environment to deal with plasticity issues. Throughout this paper, learning will often be based on the minimization of an error function E depending 1045-9227/95$04.00 0 1995 IEEE