Maximum entropy model based Classification with Feature Selection
Ambedkar Dukkipati, Abhay Kumar Yadav and M. Narasimha Murty
Dept. of Computer Science and Automation
Indian Institute of Science
Bangalore 560012, India
ambedkar@csa.iisc.ernet.in
Abstract—In this paper, we propose a classification algorithm
based on the maximum entropy principle. This algorithm
finds the most appropriate class-conditional maximum entropy
distributions for classification. No prior knowledge about the
form of density function for estimating the class conditional
density is assumed except that the information is given in
the form of expected valued of features. This algorithm also
incorporates a method to select relevant features for classifi-
cation. The proposed algorithm is suitable for large data-sets
and is demonstrated by simulation results on some real world
benchmark data-sets.
Keywords-Bayes; Jefferys divergence; sample mean;
I. I NTRODUCTION
Classification problem can be stated as follows: given a set
of N training data points (x
i
, y
i
), i =1,...,N , x
i
∈ X,
y
i
∈ Y , the goal is to find the underlying unknown mapping
or decision function h : X → Y . Most Commonly used
classifiers are support vector machines (SVM) [1], K-nearest
neighbors [2], Bayes classifier [3], Adaboost [4], decision
trees, neural networks (multi-layer perceptron) and so on.
The classification algorithms can be broadly classified as
linear classifiers and nonlinear classifiers. The decision func-
tion or decision boundary of a linear classifier is the linear
combination of the features. The most common examples
of linear classifier are support vector machines (SVM) [1],
Fisher’s linear discriminant [5] and a single layer perceptron
[6].
A classifier is said to be a nonlinear classifier, if its
decision surface is a nonlinear combination of the features,
for example, quadratic classifier. A quadratic classifier can
be a decision function,
f (x)= x
T
Ax + b
T
x + c
where, A ∈ R
d
× R
d
,b ∈ R
d
and c ∈ R, such that if
f (x) > 0, the class label assigned is c
1
, otherwise c
2
. Bayes
classifier[3] is a quadratic classifier when the multivariate
normal distribution is used as the probabilistic model for
both the classes.
Bayes classifier is specified in terms of the class con-
ditional densities. Once the class conditional densities are
estimated (parametric or non-parametric), Bayes classifier
assigns a class label to a test pattern/observation as follows:
Let P (c
1
) and P (c
2
) be the prior probabilities of the two
classes and P (x|c
1
) and P (x|c
2
) be the class conditional
densities for class c
1
and c
2
respectively. Then Bayes
classifier assigns test pattern to the class c
1
if,
P (x|c
1
)P (c
1
) >P (x|c
2
)P (c
2
) (1)
otherwise assign test pattern to the class c
2
. Also in this case
two class classification problem can easily be extended to
multiclass classification problem.
Estimation of class conditional density mainly includes
parametric and non-parametric approaches. In a parametric
approach, some form of class conditional densities are
assumed a priori.
The most commonly used model is Gaussian model. But
sometimes data may not fit well to the Gaussian model. In
this respect, maximum entropy principle offers more general
and flexible models.
Maximum entropy (ME) principle has been used to learn
statistical models in many applications such as natural
language processing [7], texture modeling [8]. In this paper,
we propose a method based on ME principle to estimate the
class conditional densities.
The paper is organized as follows. Section II gives the
information theory background. We present our proposed
method in Section III. In Section IV, we present simulation
results on some real and artificially generated datasets.
II. MAXIMUM ENTROPY FUNDAMENTALS
The maximum entropy principle (Jaynes, 1957) states that
we should make use of all the information that is given
and scrupulously avoid making assumptions about informa-
tion that is not available. Let X be a random vector i.e.,
X =(x
1
,...,x
d
),x
i
∈ χ
i
,i =1,...,d. Aim is to
estimate the distribution using the information given in the
form of expected values of some moment functions C =
{φ
1
(x),...,φ
m
(x)}. According to ME principle, out of
all those distributions consistent with the given constraints,
we should choose the distribution that maximizes Shannon
entropy,
That is, we maximize
H = −
P (x) ln P (x) dx (2)
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.143
569
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.143
569
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.143
565
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.143
565
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.143
565