Maximum entropy model based Classification with Feature Selection Ambedkar Dukkipati, Abhay Kumar Yadav and M. Narasimha Murty Dept. of Computer Science and Automation Indian Institute of Science Bangalore 560012, India ambedkar@csa.iisc.ernet.in Abstract—In this paper, we propose a classification algorithm based on the maximum entropy principle. This algorithm finds the most appropriate class-conditional maximum entropy distributions for classification. No prior knowledge about the form of density function for estimating the class conditional density is assumed except that the information is given in the form of expected valued of features. This algorithm also incorporates a method to select relevant features for classifi- cation. The proposed algorithm is suitable for large data-sets and is demonstrated by simulation results on some real world benchmark data-sets. Keywords-Bayes; Jefferys divergence; sample mean; I. I NTRODUCTION Classification problem can be stated as follows: given a set of N training data points (x i , y i ), i =1,...,N , x i X, y i Y , the goal is to find the underlying unknown mapping or decision function h : X Y . Most Commonly used classifiers are support vector machines (SVM) [1], K-nearest neighbors [2], Bayes classifier [3], Adaboost [4], decision trees, neural networks (multi-layer perceptron) and so on. The classification algorithms can be broadly classified as linear classifiers and nonlinear classifiers. The decision func- tion or decision boundary of a linear classifier is the linear combination of the features. The most common examples of linear classifier are support vector machines (SVM) [1], Fisher’s linear discriminant [5] and a single layer perceptron [6]. A classifier is said to be a nonlinear classifier, if its decision surface is a nonlinear combination of the features, for example, quadratic classifier. A quadratic classifier can be a decision function, f (x)= x T Ax + b T x + c where, A R d × R d ,b R d and c R, such that if f (x) > 0, the class label assigned is c 1 , otherwise c 2 . Bayes classifier[3] is a quadratic classifier when the multivariate normal distribution is used as the probabilistic model for both the classes. Bayes classifier is specified in terms of the class con- ditional densities. Once the class conditional densities are estimated (parametric or non-parametric), Bayes classifier assigns a class label to a test pattern/observation as follows: Let P (c 1 ) and P (c 2 ) be the prior probabilities of the two classes and P (x|c 1 ) and P (x|c 2 ) be the class conditional densities for class c 1 and c 2 respectively. Then Bayes classifier assigns test pattern to the class c 1 if, P (x|c 1 )P (c 1 ) >P (x|c 2 )P (c 2 ) (1) otherwise assign test pattern to the class c 2 . Also in this case two class classification problem can easily be extended to multiclass classification problem. Estimation of class conditional density mainly includes parametric and non-parametric approaches. In a parametric approach, some form of class conditional densities are assumed a priori. The most commonly used model is Gaussian model. But sometimes data may not fit well to the Gaussian model. In this respect, maximum entropy principle offers more general and flexible models. Maximum entropy (ME) principle has been used to learn statistical models in many applications such as natural language processing [7], texture modeling [8]. In this paper, we propose a method based on ME principle to estimate the class conditional densities. The paper is organized as follows. Section II gives the information theory background. We present our proposed method in Section III. In Section IV, we present simulation results on some real and artificially generated datasets. II. MAXIMUM ENTROPY FUNDAMENTALS The maximum entropy principle (Jaynes, 1957) states that we should make use of all the information that is given and scrupulously avoid making assumptions about informa- tion that is not available. Let X be a random vector i.e., X =(x 1 ,...,x d ),x i χ i ,i =1,...,d. Aim is to estimate the distribution using the information given in the form of expected values of some moment functions C = {φ 1 (x),...,φ m (x)}. According to ME principle, out of all those distributions consistent with the given constraints, we should choose the distribution that maximizes Shannon entropy, That is, we maximize H = P (x) ln P (x) dx (2) 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 569 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 569 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 565 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 565 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.143 565