International Journal of Computer Applications (0975 8887) Volume 51No.12, August 2012 27 MFSPFA: An Enhanced Filter based Feature Selection Algorithm V. Arul Kumar Research Scholar Dept. of Computer Science St. Joseph’s College (Autonomous) Trichy, TN, India L. Arockiam Associate Professor Dept. of Computer Science St. Joseph’s College (Autonomous) Trichy, TN, India ABSTRACT Feature Selection is the process of selecting the momentous feature subset from the original ones. This technique is frequently used as a preprocessing technique in data mining. In this study, a new feature selection algorithm is proposed and is called Modified Fisher Score Principal Feature Analysis (MFSPFA). The new algorithm is developed by combining the proposed Modified Fisher Score (MFS) and Principal Feature Analysis (PFA). The proposed algorithm is tested on publicly available datasets. The experimental results show that, the proposed algorithm is able to reduce the futile features and improves the classification accuracy. General Terms Data Mining, Classification, Filter Approach, Feature Selection Algorithms. Keywords Feature Selection, Modified Fisher Score, Principal Component Analysis, Principal Feature Analysis 1. INTRODUCTION Feature Selection is the process of selecting the subset of relevant features by removing redundant, irrelevant and noisy data from the original dataset. Feature selection methods fall into two categories: Filter approach and Wrapper approach. In filter approach, the features were selected based on criteria which are independent of the particular learning algorithm to be applied to the data. In wrapper approach, the feature selection is based on a wrapper, which is a subset of attributes and are evaluated with a learning algorithm [1]. Feature selection algorithms are categorized into Supervised Algorithms [2, 3], Unsupervised Algorithms [4, 5] and Semi-supervised Algorithms [6, 7]. In Supervised Learning, all instances are associated with the class labels. In Unsupervised Learning, no class labels are available for the instances. In Semi-supervised Learning, few instances have class labels and the remaining instances do not have the class labels [8]. The selection criterion is a key component in feature selection to select the best features. In earlier period, various selection criteria have been proposed for the filter based feature selection. They are Mutual Information [9], ReliefF [10], Laplacian Score [11], Fisher Score [12], SPEC [13], Hilbert Schmidt Independence Criterion (HSIC) [14], and Trace Ratio [15]. In feature selection technique, the most relevant features are selected. The noisy and irrelevant features are removed in the supervised method. In unsupervised method, the redundant features are removed by finding the similarity or correlation measure between the features. In the recent study, the supervised and unsupervised methods are combined to find the best feature set from the original set [1]. The problem still persists in obtaining the best feature subset and the classification accuracy. In this paper, a new feature selection algorithm is proposed by combining the MFS and PFA. MFS is a supervised method that removes the noisy and irrelevant features which have the less discriminant information. The PFA is an unsupervised method which selects the relevant features using the correlation or similarity measure and also removes the redundant features. The proposed algorithm is validated with the publicly available datasets. The result shows that the proposed algorithm can largely reduce the feature dimensions and also it improves the classification accuracy. The rest of the paper is organized into different sections: Section 2 shows the existing feature selection criteria, Section 3 describes the Principal Component Analysis (PCA), Section 4 describes the Principle Feature Analysis (PFA), Section 5 deals with the proposed algorithm, Section 6 presents our experimental results, and Section 7 gives the concluding remarks. 2. EXISTING FEATURE SELECTION CRITERIA USED FOR FINDING FEATURE SUBSET In this section, existing feature selection criteria are discussed. This includes various algorithm like ReliefF [10], Laplacian Score [11], Fisher Score [12], SPEC [13], Hilbert Schmidt Independence Criterion (HSIC) [14] and Trace Ratio [15]. 2.1 Feature selection using Fisher Score Fisher score is one of the simplest filter algorithms for feature selection [12]. In this criteria, the features are selected which have the similar values in the same class and the dissimilar values in different classes. The Fisher score is calculated using the formula   (   )     where μ i is the mean of the features s k is the number of samples in the k th class μ i,k is the mean of the features in the k th class