AN IMPROVEMENT TO THE NATURAL GRADIENT LEARNING ALGORITHM FOR MULTILAYER PERCEPTRONS Michael R. Bastian, Jacob H. Gunther and Todd K. Moon Utah State University Department of Electrical and Computer Engineering 4120 Old Main Hill, Logan, UT 84322-4120 ABSTRACT Natural gradient learning has been shown to avoid singular- ities in the parameter space of multilayer perceptrons. How- ever, it requires a large number of additional parameters beyond ordinary backpropagation. This article describes a new approach to natural gradient learning in which the num- ber of parameters necessary is much smaller than the natu- ral gradient algorithm. This new method exploits the alge- braic structure of the parameter space to reduce the space and time complexity of the algorithm and improve its per- formance. 1. INTRODUCTION Amari and his colleagues have developed natural gradi- ent learning for multilayer perceptrons [1, 2, 3], which in- stead of the steepest descent direction, uses a Quasi-Newton method [4] that exploits the Riemannian metric tensor of the underlying parameter space as the approximation to the Hessian. In the case of multilayer perceptrons, this metric tensor is the Fisher Information matrix evaluated for the cur- rent parameter. Since the Fisher Information matrix is the expected value of the Hessian matrix, it fits very nicely into a Quasi-Newton optimization framework. However, the problem with natural gradient learning is that the Fisher Information matrix must be inverted. Also, for large networks, the algorithm becomes computationally intractable because of the large number of additional pa- rameters in the Fisher Information matrix that is required during training. This problem can be mitigated by a new formulation of the Fisher Information matrix for multilayer perceptrons. This article describes this new formulation. By picking an inner product for the parameter space and inducing a norm, a Sobolev Gradient [5] may be formulated such that the Fisher Information matrix has much smaller dimensions than in the Adaptive Natural Gradient algorithm [2] and, as Thanks to Anteon Corporation for funding this research. a result, the learning algorithm performs better is more ro- bust to initial values and noise. 2. A NEW PARAMETERIZATION Let the random variable be the feature map of a multi- layer perceptron . Let the function be the output of the perceptron when given as the input. Let the ran- dom variable be the desired output of the perceptron and the random variable (1) be the output error of the perceptron. Let be the optimal perceptron such that is minimized. 2.1. Mapping to an Anti-Diagonal Block-Matrix A multilayer perceptron with layers is parameterized by matrices ; each matrix represents the con- nection weights of the layer. Let be a mapping of the matrices into a block matrix such that lie on the anti-diagonal of the block matrix. . . . . . . . . . . . . (2) This structure is similar in form to the weighted adjacency matrix of an -partite graph [6]. 2.2. The Inner Product of Two Perceptrons An inner product of two perceptrons and is (3) V - 313 0-7803-8874-7/05/$20.00 ©2005 IEEE ICASSP 2005