AN IMPROVEMENT TO THE NATURAL GRADIENT LEARNING ALGORITHM FOR
MULTILAYER PERCEPTRONS
Michael R. Bastian, Jacob H. Gunther and Todd K. Moon
Utah State University
Department of Electrical and Computer Engineering
4120 Old Main Hill, Logan, UT 84322-4120
ABSTRACT
Natural gradient learning has been shown to avoid singular-
ities in the parameter space of multilayer perceptrons. How-
ever, it requires a large number of additional parameters
beyond ordinary backpropagation. This article describes a
new approach to natural gradient learning in which the num-
ber of parameters necessary is much smaller than the natu-
ral gradient algorithm. This new method exploits the alge-
braic structure of the parameter space to reduce the space
and time complexity of the algorithm and improve its per-
formance.
1. INTRODUCTION
Amari and his colleagues have developed natural gradi-
ent learning for multilayer perceptrons [1, 2, 3], which in-
stead of the steepest descent direction, uses a Quasi-Newton
method [4] that exploits the Riemannian metric tensor of
the underlying parameter space as the approximation to the
Hessian. In the case of multilayer perceptrons, this metric
tensor is the Fisher Information matrix evaluated for the cur-
rent parameter. Since the Fisher Information matrix is the
expected value of the Hessian matrix, it fits very nicely into
a Quasi-Newton optimization framework.
However, the problem with natural gradient learning is
that the Fisher Information matrix must be inverted. Also,
for large networks, the algorithm becomes computationally
intractable because of the large number of additional pa-
rameters in the Fisher Information matrix that is required
during training. This problem can be mitigated by a new
formulation of the Fisher Information matrix for multilayer
perceptrons.
This article describes this new formulation. By picking
an inner product for the parameter space and inducing a
norm, a Sobolev Gradient [5] may be formulated such that
the Fisher Information matrix has much smaller dimensions
than in the Adaptive Natural Gradient algorithm [2] and, as
Thanks to Anteon Corporation for funding this research.
a result, the learning algorithm performs better is more ro-
bust to initial values and noise.
2. A NEW PARAMETERIZATION
Let the random variable be the feature map of a multi-
layer perceptron . Let the function be the output
of the perceptron when given as the input. Let the ran-
dom variable be the desired output of the perceptron and
the random variable
(1)
be the output error of the perceptron. Let be the optimal
perceptron such that is minimized.
2.1. Mapping to an Anti-Diagonal Block-Matrix
A multilayer perceptron with layers is parameterized by
matrices ; each matrix represents the con-
nection weights of the layer.
Let be a mapping of the matrices into a block matrix
such that lie on the anti-diagonal of the
block matrix.
.
.
.
.
.
.
.
.
.
.
.
.
(2)
This structure is similar in form to the weighted adjacency
matrix of an -partite graph [6].
2.2. The Inner Product of Two Perceptrons
An inner product of two perceptrons
and
is
(3)
V - 313 0-7803-8874-7/05/$20.00 ©2005 IEEE ICASSP 2005
➠ ➡