Performance of Feed Forward Neural Network for a Novel Feature Selection Approach Barnali Sahu # , Debahuti Mishra * # Department of computer science and Engineering, Trident Academy and Technology, Bijupattnaik University Bhubaneswar, Odisha, India * Department of Computer Science and Engineering, Institute of Technical Education and Research, Siksha O Anusandhan University, Bhubaneswar, Odisha, India Abstract— Feature selection for classification of cancer data is to discover gene expression profiles of diseased and healthy tissues and use the knowledge to predict the health state of new sample. It is usually impractical to go through all the details of the features before picking up the right features. The differentially expressed genes or biomarker gene selection is the pre- processing task for cancer classification. In this paper, we have compared the results of two approaches for selecting biomarkers from Leukaemia data set for feed forward neural networks. The first approach for feature selection is by implementing k-means clustering and signal-to-noise ratio (SNR) method for gene ranking, the top scored genes from each cluster is selected and given to the classifiers. The second approach uses signal to noise ratio ranking only for feature selection. For validation of both the approaches we have used Holdout validation and compared the results. Keywords— Differentially Expressed Genes, Feature Selection, K- means, Signal to Noise ratio, Feed forward neural network I. INTRODUCTION Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. With the development of genomic techniques, research on molecular biology has shifted from individual genes to the entire genomes. Microarray technology can measure the expression levels of thousands of genes in a single experiment. With a certain number of samples, investigations can be made into whether there are patterns or dissimilarities across samples of different types, such as cancerous versus normal or even within subtypes of disease. The problem is referred to as sample classification. In a microarray chip, the number of genes available is far greater than that of samples, however most genes in microarray give little benefits to the sample classification problem. Therefore, prior to sample classification, it is important to perform gene selection whereby more interpretable genes are identified as biomarkers, so that a more efficient, accurate and reliable performance in classification can be expected This many high level data analysis techniques such as clustering and classification algorithms work well with small number of genes. This approach usually covers one or more components of microarray data analysis that include dimensionality reduction through a gene subset selection, the construction of new predictive features and model inference [1]. The gene expression microarray technology allows us to measure expressions of thousands of genes simultaneously in a single experiment. This technique presents gene expression data of an organism in different environment or different expression of a gene in different organism. Microarray data are generally high dimensional data having large number of genes in comparison to the number of samples or conditions. Hence, it suffers from a very well known problem of “curse of dimensionality”. Due to this problem it is very complex to analyse microarray data. There are many efficient methods for the analysis of microarray data such as clustering, classification and feature selection. Feature selection is the pre-processing task for classification. As classification does not work well with large numbers of features hence prior to sample classification feature (gene) selection is essential, where by more relevant and interpretable genes can be filtered. These relevant genes are known as discriminative genes or Biomarkers. By training the classifiers with the biomarkers we can achieve better classification accuracy with a low risk of misclassification. The benefits obtained from gene selection are not only to get better classification accuracy but also to decrease the cost in a clinical setting. It also enhances the interpretability of genetic nature of the disease for biologists [2], [3], [4], [5]. As microarray data are high dimensional data, there may be noise present in the data. With noisy data the performance and efficiency of the model may decrease. There are several feature selection methods available to resolve the problem and to increase the efficiency of the model [6]. The well known feature selection methods are: filter and wrapper method. Filter method rank the features according to their discriminative power with regard to the class labels of samples where as wrapper approach selects a subset of features from the original feature set with respect to a classifier. Filter methods such as signal-to-noise ratio [7], t- Statistics [8], F-test [9] have been shown to be effective scores for measuring discriminative power of features in microarray data. In all cases genes are ranked according to their statistical scores and a certain number of highest ranking genes are selected for the purpose of classification. A. Goal of the Paper The goal of this paper is to find differentially expressed genes by applying clustering technique to group similar genes before implementing filtering techniques to filter relevant gene subset and to enhance the accuracy of the filtering technique. We have adopted two different approaches for relevant gene selection. In first approach we have used k- means clustering technique for grouping the features in the data set, as genes in a cluster are more correlated with each other with respect to genes present in different clusters. After that we have implemented different filtering technique to rank the genes in each cluster. The best scored features in each cluster are then selected. After that the data with these features are tested using feed forward Neural Network classifiers, and the performance is compared in two approaches. Barnali Sahu et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (4) , 2011, 1414-1419 1414