Appl Intell (2012) 37:100–120
DOI 10.1007/s10489-011-0315-y
A novel feature selection method based on normalized mutual
information
La The Vinh · Sungyoung Lee · Young-Tack Park ·
Brian J. d’Auriol
Published online: 23 August 2011
© Springer Science+Business Media, LLC 2011
Abstract In this paper, a novel feature selection method
based on the normalization of the well-known mutual in-
formation measurement is presented. Our method is de-
rived from an existing approach, the max-relevance and min-
redundancy (mRMR) approach. We, however, propose to
normalize the mutual information used in the method so
that the domination of the relevance or of the redundancy
can be eliminated. We borrow some commonly used recog-
nition models including Support Vector Machine (SVM),
k-Nearest-Neighbor (kNN), and Linear Discriminant Anal-
ysis (LDA) to compare our algorithm with the original
(mRMR) and a recently improved version of the mRMR, the
Normalized Mutual Information Feature Selection (NMIFS)
algorithm. To avoid data-specific statements, we conduct our
classification experiments using various datasets from the
UCI machine learning repository. The results confirm that
our feature selection method is more robust than the others
with regard to classification accuracy.
Keywords Feature selection · Mutual information ·
Minimal redundancy · Maximal relevance
1 Introduction
Feature selection is a technique for selecting a subset of rele-
vant features, which contain information to help distinguish
L.T. Vinh · S. Lee ( ) · B.J. d’Auriol
Dept. of Computer Engineering, Kyung Hee University, Seoul,
Korea
e-mail: sylee@oslab.khu.ac.kr
Y.-T. Park
School of IT, Soongsil University, Seoul, Korea
e-mail: park@ssu.ac.kr
one class from the others, from a large number of features
extracted from the input data. Feature selection is differ-
ent from feature extraction [11], wherein a new set of fea-
tures is formed by projecting the original feature space into
a reduced-dimension space. In the present paper, we focus
only on feature selection methods.
In pattern recognition, the identification of the most dis-
criminative features is an important step [7], since it is com-
mon to have a large number of features, including relevant
as well as irrelevant features, at the beginning of the pattern
recognition process [11, 15]. Feeding a large set of features
into a recognition model not only increases the computation
burden but also causes the problem commonly known as the
curse of dimensionality. Therefore, removing irrelevant fea-
tures helps speed up the learning process and alleviates the
effect of the curse of dimensionality. Due to the capabilities,
feature selection has been largely applied in many applica-
tions, including text classification [6, 12], bio-informatics
[8, 24, 32], intrusion detection [18, 27], and image retrieval
[5, 9]. Furthermore, feature selection facilitates the data vi-
sualization and understanding [14, 17, 31].
So far, there is a great number of methods in the feature
selection research area. Those methods can be categorized
into three main directions namely wrapper, embedded and
filter. Wrapper approaches [25, 29] make use of the clas-
sification accuracy to evaluate the usefulness of features at
each step. However, repeatedly training such classifiers of-
ten requires high computational cost, making the wrapper
based methods impractical with large datasets. Besides, the
performance of wrapper approach may strictly depend on
the classifier being used in the evaluation.
Embedded methods [4, 33] also use particular classifiers
to find feature subsets. They, however, select features in the
training phase of the classifier. Thus, embedded methods can
utilize extra information of the cost function to guide the