Appl Intell (2012) 37:100–120 DOI 10.1007/s10489-011-0315-y A novel feature selection method based on normalized mutual information La The Vinh · Sungyoung Lee · Young-Tack Park · Brian J. d’Auriol Published online: 23 August 2011 © Springer Science+Business Media, LLC 2011 Abstract In this paper, a novel feature selection method based on the normalization of the well-known mutual in- formation measurement is presented. Our method is de- rived from an existing approach, the max-relevance and min- redundancy (mRMR) approach. We, however, propose to normalize the mutual information used in the method so that the domination of the relevance or of the redundancy can be eliminated. We borrow some commonly used recog- nition models including Support Vector Machine (SVM), k-Nearest-Neighbor (kNN), and Linear Discriminant Anal- ysis (LDA) to compare our algorithm with the original (mRMR) and a recently improved version of the mRMR, the Normalized Mutual Information Feature Selection (NMIFS) algorithm. To avoid data-speciﬁc statements, we conduct our classiﬁcation experiments using various datasets from the UCI machine learning repository. The results conﬁrm that our feature selection method is more robust than the others with regard to classiﬁcation accuracy. Keywords Feature selection · Mutual information · Minimal redundancy · Maximal relevance 1 Introduction Feature selection is a technique for selecting a subset of rele- vant features, which contain information to help distinguish L.T. Vinh · S. Lee ( ) · B.J. d’Auriol Dept. of Computer Engineering, Kyung Hee University, Seoul, Korea e-mail: sylee@oslab.khu.ac.kr Y.-T. Park School of IT, Soongsil University, Seoul, Korea e-mail: park@ssu.ac.kr one class from the others, from a large number of features extracted from the input data. Feature selection is differ- ent from feature extraction [11], wherein a new set of fea- tures is formed by projecting the original feature space into a reduced-dimension space. In the present paper, we focus only on feature selection methods. In pattern recognition, the identiﬁcation of the most dis- criminative features is an important step [7], since it is com- mon to have a large number of features, including relevant as well as irrelevant features, at the beginning of the pattern recognition process [11, 15]. Feeding a large set of features into a recognition model not only increases the computation burden but also causes the problem commonly known as the curse of dimensionality. Therefore, removing irrelevant fea- tures helps speed up the learning process and alleviates the effect of the curse of dimensionality. Due to the capabilities, feature selection has been largely applied in many applica- tions, including text classiﬁcation [6, 12], bio-informatics [8, 24, 32], intrusion detection [18, 27], and image retrieval [5, 9]. Furthermore, feature selection facilitates the data vi- sualization and understanding [14, 17, 31]. So far, there is a great number of methods in the feature selection research area. Those methods can be categorized into three main directions namely wrapper, embedded and ﬁlter. Wrapper approaches [25, 29] make use of the clas- siﬁcation accuracy to evaluate the usefulness of features at each step. However, repeatedly training such classiﬁers of- ten requires high computational cost, making the wrapper based methods impractical with large datasets. Besides, the performance of wrapper approach may strictly depend on the classiﬁer being used in the evaluation. Embedded methods [4, 33] also use particular classiﬁers to ﬁnd feature subsets. They, however, select features in the training phase of the classiﬁer. Thus, embedded methods can utilize extra information of the cost function to guide the