Natural Computing 1: 85–108, 2002.
© 2002 Kluwer Academic Publishers. Printed in the Netherlands.
Beyond second-order statistics for learning:
A pairwise interaction model for entropy estimation
DENIZ ERDOGMUS, JOSE C. PRINCIPE and KENNETH E. HILD II
Computational NeuroEngineering Laboratory, Electrical & Computer Engineering
Department, University of Florida, Gainesville, FL 32611, USA
Abstract. Second order statistics have formed the basis of learning and adaptation due to its
appeal and analytical simplicity. On the other hand, in many realistic engineering problems
requiring adaptive solutions, it is not sufficient to consider only the second order statistics of
the underlying distributions. Entropy, being the average information content of a distribution,
is a better-suited criterion for adaptation purposes, since it allows the designer to manipulate
the information content of the signals rather than merely their power. This paper introduces a
nonparametric estimator of Renyi’s entropy, which can be utilized in any adaptation scenario
where entropy plays a role. This nonparametric estimator leads to an interesting analogy
between learning and interacting particles in a potential field. It turns out that learning by
second order statistics is a special case of this interaction model for learning. We investi-
gate the mathematical properties of this nonparametric entropy estimator, provide batch and
stochastic gradient expressions for off-line and on-line adaptation, and illustrate the perform-
ance of the corresponding algorithms in examples of supervised and unsupervised training,
including time-series prediction and ICA.
Key words: adaptation, information theory, learning, Renyi’s entropy
1. Introduction
The mean square error (MSE) has been the workhorse of optimal data fitting
models since the early work of Gauss in the 19th century. Both optimal linear
filtering and pattern recognition formulations have utilized extensively MSE
for very good reasons. In data fitting with the linear model, MSE yields a solu-
tion that is linear in the weights and can be analytically computed (the famous
least square method). Under the Gaussian assumption for the error, the MSE
provides the maximum likelihood solution, and so it has gained acceptance in
parameter estimation (Scharf 1990). The classical work of Wiener on optimal
filters in the MSE sense provided the theoretical framework (Wiener 1949)
and the stochastic gradient by Widrow (Widrow and Stearns 1985), which
gave rise to the LMS algorithm, tremendously decreased the computational
complexity of adapting filters, and more importantly opened new horizons for
adaptive systems.