978-1-4799-7633-1/14/$31.00 2014 IEEE 158 A Correlation based Multilayer Perceptron algorithm for Cancer Classification with Gene-Expression Dataset Sujata Dash North Orissa University, Baripada, Odisha, India. Sujata_dash@yahoo.com Ankita Dash HCL Technologies, Kolkata, India. ankitajinu@gmail.com Abstract: Research on feature selection techniques for identifying informative genes from high dimensional microarray datasets has received considerable attention. Numerous researchers have proposed various optimized solutions to reduce noises, redundancy in dataset and to enhance the accuracy and generalization of the classification model by applying many computational tools. High- dimensional microarray gene expression dataset has limitations to many feature selection techniques with respect to generalization and effectiveness. A robust feature selection technique need to be designed which can remove irrelevant data, increase learning accuracy and improve comprehensibility of the experimental result. In this work, a novel correlation based feature selection algorithm using symmetrical uncertainty and multilayer perceptron algorithm is proposed. This method can identify the relevancy of the features to the class and also the redundancy considering all other relevant features of the dataset. It also evaluates the worth of a set attributes by measuring the symmetrical uncertainty with respect to another set of attributes. The effectiveness of the method is validated through various correlation based feature selection techniques using multi-category high-dimensional microarray datasets. Keywords: Correlation-based, Symmetrical uncertainty, microarray dataset, feature relevancy, feature redundancy I. INTRODUCTION Cancer is caused due to the changes or mutation in the expression profiles of certain genes which necessitates the importance of feature selection techniques to find relevant genes for classification of the disease. The most significant genes selected from the process are useful in clinical diagnosis for identifying disease profiles [1]. The discriminative genes are selected through feature selection techniques that aim to select an optimal subset of genes. But, high dimension and small sample size characteristics of microarray dataset creates lot of computational challenges for selecting optimal subsets of genes such as the problem of “ curse of dimensionality” and over-fitting of training dataset. Feature selection is often used as a preprocessing step in machine learning. Only non- redundant and relevant features are sufficient enough to provide effective and efficient learning. However, selecting an optimal subset is very difficult [2] as the possible number of subsets grows exponentially when the dimension of the set increases. The feature selection techniques can be broadly classified into filter [3], [4], [5] and wrapper model [6], [7]. The filter model uses specific evaluation criterion which is independent of learning algorithm to select feature subset from the dataset. It depends on various evaluation measures which are employed on the general characteristics of the training data such as information, distance, consistency and dependency. The wrapper method measures the goodness of the selected subsets using the predictive accuracy of the learning algorithm. But these methods require intensive computation for high dimensional dataset. Apart from this another key factor in feature selection is search strategy. The trade-off between optimal solution and computational efficiency is attained by adopting an appropriate search strategy such as random, exhaustive and heuristic search [8]. There are feature selection methods available for supervised [5], [9] and unsupervised [10] learning methods and it has been applied in several applications like genomic microarray data analysis, image retrieval, text categorization, intrusion detection etc. But, the theoretical and empirical analysis has demonstrated that the presence of irrelevant and redundant features [2], [3] in the dataset reduces the speed and accuracy of the learning algorithms, thus need to be removed from the dataset. Most of the feature selection techniques employed so far have considered individual feature evaluation and feature subset evaluation [11]. Individual feature evaluation method ranks the features with respect to their capability of differentiating instances of different classes and eliminates the irrelevant and redundant features likely to have the same rankings. The feature subset evaluation method finds a subset of minimum features satisfying measure of goodness removes irrelevant and redundant features. It is observed that the advance search strategies like heuristic search and greedy search used for subset evaluation even after reducing the search space from O (2 N ) to O (N 2 ) prove to be inefficient for high-dimensional dataset. This shortcoming encourages exploring different techniques for feature selection which will address both feature relevance and redundancy for high-dimensional microarray dataset.