AN EFFICIENT RANK-DEFICIENT COMPUTATION OF THE PRINCIPLE OF RELEVANT
INFORMATION
Luis Gonzalo S´ anchez Giraldo, Jos´ e C. Pr´ ıncipe
University of Florida
Department of Electrical and Computer Engineering
Gainesville, FL 32611
{sanchez,principe}@cnel.ufl.edu
ABSTRACT
One of the main difficulties in computing information theoretic
learning (ITL) estimators is the computational complexity that
grows quadratically with data. Considerable amount of work has
been done on computation of low rank approximations of Gram
matrices without accessing all their elements. In this paper we
discuss how these techniques can be applied to reduce computa-
tional complexity of Principle of Relevant Information (PRI). This
particular objective function involves estimators of Renyi’s sec-
ond order entropy and cross-entropy and their gradients, therefore
posing a technical challenge for implementation in a realistic sce-
nario. Moreover, we introduce a simple modification to the Nystr¨ om
method motivated by the idea that our estimator must perform accu-
rately only for certain vectors not for all possible cases. We show
some results on how this rank deficient decompositions allow the
application of the PRI on moderately large datasets.
Index Terms— Kernel methods, Information Theoretic Learn-
ing, Rank deficient factorization, Nystr¨ om method.
1. INTRODUCTION
In recent years, kernel methods have received increasing attention
within the machine learning community. They are theoretically el-
egant, algorithmically simple, and have shown considerable success
in several practical problems. Information theoretic learning (ITL) is
another emerging line of research with links to kernel based estima-
tion. However, ITL stems from a conceptually different framework
[1]. For instance, the type of kernels employed ITL need not be
positive definite [2]. Despite this fundamental difference, it is often
the case that applications of ITL use either Gaussian or Laplacian
kernels which are positive definite.
An important feature of information theory is that it casts prob-
lems in principled operational quantities that have direct interpreta-
tion. For example, in unsupervised learning some of the paradigms
have emerged from Information Theory. The Infomax principle [3]
and the preservation of mutual information across systems [4] are
within this category. Regularities in data can reveal the structure
of the underlying generating process, therefore capturing the struc-
ture is a problem of relevance determination. Capturing low entropy
components of a random variable can be attributed to the its genera-
tion process. Rao et. al. [5] proposes an ITL objective that attempts
to capture the underlying structure through a random variable’s PDF,
this is called the principle of relevant information.
This work is funded by ONR N00014-10-1-0375 and UF ECE Depart-
ment Latin American Fellowship Award.
A major issue, which we address in this paper, is that the amount
of computation associated to the PRI grows quadratically with the
size of the available sample. This limits the scale of the applications
if one were to apply the formulas directly. The problem of poly-
nomial growth on complexity has also received attention within the
machine learning community working on kernel methods. Conse-
quently, approaches to compute approximations to positive semidefi-
nite matrices based on kernels have been proposed [6, 7]. The goal of
these methods is to accurately estimate large Gram matrices without
computing their n
2
elements, directly. It has been observed that in
practice the eigenvalues of the Gram matrix drop rapidly and there-
fore replacing the original matrix by a low rank approximation seems
reasonable[7, 8]. In our work, we derive an algorithm for the princi-
ple of relevant information based on rank deficient approximations
of a Gram matrix. We also propose a simple modified version of the
Nystr¨ om method particularly suited for estimation in ITL.
The paper starts with a brief introduction to Renyi’s Entropy and
the associated information quantities with their corresponding rank
deficient approximations. Then, the objective function for the prin-
ciple of relevant information (PRI) is presented. Following, we pro-
pose an implementation of the optimization problem based on rank
deficient approximations. The algorithm is tested on simulated data
for various accuracy regimes (different ranks) followed by some re-
sults on realistic scenarios. Finally, we provide some conclusions
along with future work directions.
2. RANK DEFICIENT APPROXIMATION FOR ITL
2.1. Renyi’s α-Order Entropy and Related Functions
In information theory, a natural extension of the commonly used
Shannon’s entropy is α-order entropy proposed by Renyi [9]. For
a random variable X with probability density function (PDF) f (x)
and support X , the α-entropy Hα(X) is defined as;
Hα(f )=
1
1 − α
log
X
f
α
(x)dx. (1)
The case α → 1 gives Shannon’s entropy. Similarly, a modified
version of Renyi’s definition of α-relative entropy between random
variables with PDFs f and g is given in [10],
Dα(f g) = log
(
g
α−1
f
) 1
1−α
(
g
α
) 1
α
(
f
α
) 1
α(1−α)
. (2)
likewise, α → 1 yields Shannon’s relative entropy (KL divergence).
An important component of the relative entropy is the cross-entropy
2176 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011