AN EFFICIENT RANK-DEFICIENT COMPUTATION OF THE PRINCIPLE OF RELEVANT INFORMATION Luis Gonzalo S´ anchez Giraldo, Jos´ e C. Pr´ ıncipe University of Florida Department of Electrical and Computer Engineering Gainesville, FL 32611 {sanchez,principe}@cnel.ufl.edu ABSTRACT One of the main difﬁculties in computing information theoretic learning (ITL) estimators is the computational complexity that grows quadratically with data. Considerable amount of work has been done on computation of low rank approximations of Gram matrices without accessing all their elements. In this paper we discuss how these techniques can be applied to reduce computa- tional complexity of Principle of Relevant Information (PRI). This particular objective function involves estimators of Renyi’s sec- ond order entropy and cross-entropy and their gradients, therefore posing a technical challenge for implementation in a realistic sce- nario. Moreover, we introduce a simple modiﬁcation to the Nystr¨ om method motivated by the idea that our estimator must perform accu- rately only for certain vectors not for all possible cases. We show some results on how this rank deﬁcient decompositions allow the application of the PRI on moderately large datasets. Index Terms— Kernel methods, Information Theoretic Learn- ing, Rank deﬁcient factorization, Nystr¨ om method. 1. INTRODUCTION In recent years, kernel methods have received increasing attention within the machine learning community. They are theoretically el- egant, algorithmically simple, and have shown considerable success in several practical problems. Information theoretic learning (ITL) is another emerging line of research with links to kernel based estima- tion. However, ITL stems from a conceptually different framework [1]. For instance, the type of kernels employed ITL need not be positive deﬁnite [2]. Despite this fundamental difference, it is often the case that applications of ITL use either Gaussian or Laplacian kernels which are positive deﬁnite. An important feature of information theory is that it casts prob- lems in principled operational quantities that have direct interpreta- tion. For example, in unsupervised learning some of the paradigms have emerged from Information Theory. The Infomax principle [3] and the preservation of mutual information across systems [4] are within this category. Regularities in data can reveal the structure of the underlying generating process, therefore capturing the struc- ture is a problem of relevance determination. Capturing low entropy components of a random variable can be attributed to the its genera- tion process. Rao et. al. [5] proposes an ITL objective that attempts to capture the underlying structure through a random variable’s PDF, this is called the principle of relevant information. This work is funded by ONR N00014-10-1-0375 and UF ECE Depart- ment Latin American Fellowship Award. A major issue, which we address in this paper, is that the amount of computation associated to the PRI grows quadratically with the size of the available sample. This limits the scale of the applications if one were to apply the formulas directly. The problem of poly- nomial growth on complexity has also received attention within the machine learning community working on kernel methods. Conse- quently, approaches to compute approximations to positive semideﬁ- nite matrices based on kernels have been proposed [6, 7]. The goal of these methods is to accurately estimate large Gram matrices without computing their n 2 elements, directly. It has been observed that in practice the eigenvalues of the Gram matrix drop rapidly and there- fore replacing the original matrix by a low rank approximation seems reasonable[7, 8]. In our work, we derive an algorithm for the princi- ple of relevant information based on rank deﬁcient approximations of a Gram matrix. We also propose a simple modiﬁed version of the Nystr¨ om method particularly suited for estimation in ITL. The paper starts with a brief introduction to Renyi’s Entropy and the associated information quantities with their corresponding rank deﬁcient approximations. Then, the objective function for the prin- ciple of relevant information (PRI) is presented. Following, we pro- pose an implementation of the optimization problem based on rank deﬁcient approximations. The algorithm is tested on simulated data for various accuracy regimes (different ranks) followed by some re- sults on realistic scenarios. Finally, we provide some conclusions along with future work directions. 2. RANK DEFICIENT APPROXIMATION FOR ITL 2.1. Renyi’s α-Order Entropy and Related Functions In information theory, a natural extension of the commonly used Shannon’s entropy is α-order entropy proposed by Renyi [9]. For a random variable X with probability density function (PDF) f (x) and support X , the α-entropy Hα(X) is deﬁned as; Hα(f )= 1 1 − α log  X f α (x)dx. (1) The case α → 1 gives Shannon’s entropy. Similarly, a modiﬁed version of Renyi’s deﬁnition of α-relative entropy between random variables with PDFs f and g is given in [10], Dα(f g) = log ( g α−1 f ) 1 1−α ( g α ) 1 α ( f α ) 1 α(1−α) . (2) likewise, α → 1 yields Shannon’s relative entropy (KL divergence). An important component of the relative entropy is the cross-entropy 2176 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011