Natural Computing 1: 85–108, 2002. © 2002 Kluwer Academic Publishers. Printed in the Netherlands. Beyond second-order statistics for learning: A pairwise interaction model for entropy estimation DENIZ ERDOGMUS, JOSE C. PRINCIPE and KENNETH E. HILD II Computational NeuroEngineering Laboratory, Electrical & Computer Engineering Department, University of Florida, Gainesville, FL 32611, USA Abstract. Second order statistics have formed the basis of learning and adaptation due to its appeal and analytical simplicity. On the other hand, in many realistic engineering problems requiring adaptive solutions, it is not sufficient to consider only the second order statistics of the underlying distributions. Entropy, being the average information content of a distribution, is a better-suited criterion for adaptation purposes, since it allows the designer to manipulate the information content of the signals rather than merely their power. This paper introduces a nonparametric estimator of Renyi’s entropy, which can be utilized in any adaptation scenario where entropy plays a role. This nonparametric estimator leads to an interesting analogy between learning and interacting particles in a potential field. It turns out that learning by second order statistics is a special case of this interaction model for learning. We investi- gate the mathematical properties of this nonparametric entropy estimator, provide batch and stochastic gradient expressions for off-line and on-line adaptation, and illustrate the perform- ance of the corresponding algorithms in examples of supervised and unsupervised training, including time-series prediction and ICA. Key words: adaptation, information theory, learning, Renyi’s entropy 1. Introduction The mean square error (MSE) has been the workhorse of optimal data fitting models since the early work of Gauss in the 19th century. Both optimal linear filtering and pattern recognition formulations have utilized extensively MSE for very good reasons. In data fitting with the linear model, MSE yields a solu- tion that is linear in the weights and can be analytically computed (the famous least square method). Under the Gaussian assumption for the error, the MSE provides the maximum likelihood solution, and so it has gained acceptance in parameter estimation (Scharf 1990). The classical work of Wiener on optimal filters in the MSE sense provided the theoretical framework (Wiener 1949) and the stochastic gradient by Widrow (Widrow and Stearns 1985), which gave rise to the LMS algorithm, tremendously decreased the computational complexity of adapting filters, and more importantly opened new horizons for adaptive systems.