Maximum entropy model based Classiﬁcation with Feature Selection Ambedkar Dukkipati, Abhay Kumar Yadav and M. Narasimha Murty Dept. of Computer Science and Automation Indian Institute of Science Bangalore 560012, India ambedkar@csa.iisc.ernet.in Abstract—In this paper, we propose a classiﬁcation algorithm based on the maximum entropy principle. This algorithm ﬁnds the most appropriate class-conditional maximum entropy distributions for classiﬁcation. No prior knowledge about the form of density function for estimating the class conditional density is assumed except that the information is given in the form of expected valued of features. This algorithm also incorporates a method to select relevant features for classiﬁ- cation. The proposed algorithm is suitable for large data-sets and is demonstrated by simulation results on some real world benchmark data-sets. Keywords-Bayes; Jefferys divergence; sample mean; I. I NTRODUCTION Classiﬁcation problem can be stated as follows: given a set of N training data points (x i , y i ), i =1,...,N , x i ∈ X, y i ∈ Y , the goal is to ﬁnd the underlying unknown mapping or decision function h : X → Y . Most Commonly used classiﬁers are support vector machines (SVM) [1], K-nearest neighbors [2], Bayes classiﬁer [3], Adaboost [4], decision trees, neural networks (multi-layer perceptron) and so on. The classiﬁcation algorithms can be broadly classiﬁed as linear classiﬁers and nonlinear classiﬁers. The decision func- tion or decision boundary of a linear classiﬁer is the linear combination of the features. The most common examples of linear classiﬁer are support vector machines (SVM) [1], Fisher’s linear discriminant [5] and a single layer perceptron [6]. A classiﬁer is said to be a nonlinear classiﬁer, if its decision surface is a nonlinear combination of the features, for example, quadratic classiﬁer. A quadratic classiﬁer can be a decision function, f (x)= x T Ax + b T x + c where, A ∈ R d × R d ,b ∈ R d and c ∈ R, such that if f (x) > 0, the class label assigned is c 1 , otherwise c 2 . Bayes classiﬁer[3] is a quadratic classiﬁer when the multivariate normal distribution is used as the probabilistic model for both the classes. Bayes classiﬁer is speciﬁed in terms of the class con- ditional densities. Once the class conditional densities are estimated (parametric or non-parametric), Bayes classiﬁer assigns a class label to a test pattern/observation as follows: Let P (c 1 ) and P (c 2 ) be the prior probabilities of the two classes and P (x|c 1 ) and P (x|c 2 ) be the class conditional densities for class c 1 and c 2 respectively. Then Bayes classiﬁer assigns test pattern to the class c 1 if, P (x|c 1 )P (c 1 ) >P (x|c 2 )P (c 2 ) (1) otherwise assign test pattern to the class c 2 . Also in this case two class classiﬁcation problem can easily be extended to multiclass classiﬁcation problem. Estimation of class conditional density mainly includes parametric and non-parametric approaches. In a parametric approach, some form of class conditional densities are assumed a priori. The most commonly used model is Gaussian model. But sometimes data may not ﬁt well to the Gaussian model. In this respect, maximum entropy principle offers more general and ﬂexible models. Maximum entropy (ME) principle has been used to learn statistical models in many applications such as natural language processing [7], texture modeling [8]. In this paper, we propose a method based on ME principle to estimate the class conditional densities. The paper is organized as follows. Section II gives the information theory background. We present our proposed method in Section III. In Section IV, we present simulation results on some real and artiﬁcially generated datasets. II. MAXIMUM ENTROPY FUNDAMENTALS The maximum entropy principle (Jaynes, 1957) states that we should make use of all the information that is given and scrupulously avoid making assumptions about informa- tion that is not available. Let X be a random vector i.e., X =(x 1 ,...,x d ),x i ∈ χ i ,i =1,...,d. Aim is to estimate the distribution using the information given in the form of expected values of some moment functions C = {φ 1 (x),...,φ m (x)}. According to ME principle, out of all those distributions consistent with the given constraints, we should choose the distribution that maximizes Shannon entropy, That is, we maximize H = −  P (x) ln P (x) dx (2) 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 569 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 569 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 565 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 565 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 565