Abstract— Feature Selection is the selection of most relevant features or attributes from a large set of data. The number of features is very large while the number of samples is relatively small in the micro-array data analysis. In this project, Feature selection techniques are used to select particular features for classifier. A supervised attribute clustering algorithm based on mutual information is implemented to measure the similarity between attributes. This similarity measure is useful for reducing the redundancy among the attributes. Among the large amount of feature set, only a small fraction is effective for performing a certain task. Also, a small subset of features is desirable in developing gene expression-based diagnostic tools for delivering precise, reliable, and interpretable results. Hence our work proposes to find such effective features among the high-dimensional diagnosis dataset. The work proceeds with the cancer dataset with respect to the class label. It uses the Supervised Clustering Technique which identifies the most correlated feature set based on the mutual information. It finds the relevance value for each attribute with respect to the class label. The relevance is computed by finding the entropy and conditional entropy of every attribute depending on the class label. Then, the supervised similarity is found out between the attributes according to the class labels present in the dataset. The Supervised Similarity method concentrates on unsupervised and supervised similarity measures in order to obtain the most correlated features. Finally, the SMO and C4.5 classification algorithms are applied on the selected feature set and it has been found that classification accuracy for SMO is greater than C4.5. Keywords— feature selection, attribute clustering, mutual information, microarray classification. I. INTRODUCTION Microarray gene expression data set can be represented by an expression table, where each row corresponds to Dr S.Senthamarai Kannan 1 is with the Sethu Institute of Technology Pulloor, Kariapatti,India, phone: +919865578848 ; fax:+914566(308000) ; e- mail: stanfordssk@gmail.com ). Sherin Mariam John is P.G Student of M.E C.S.E with the Sethu Institute of Technology; email: sherinmjohn@gmail.com. P.Sundaravadivel 3 is Ph.D Scholar of Anna University Chennai ,India (e- mail: sundar.me2009@gmail.com ). S.Ilangovan 4 is Ph.D Scholar of Anna University Chennai ,India (e-mail: ilangovans@yahoo.com). Dr A. Vincent Antony Kumar 5 is serving as Professor of M.C.A with PSNA College of Engineering & Technology, Din Digul, India, e-mail: hodit@psnacet.edu.in) . one particular gene, each column to a sample, and each entry of the matrix is the measured expression level of a particular gene in a sample, respectively. However, for most gene expression data, the number of training samples is still very small compared to the large number of genes involved in the experiments. When the number of genes is significantly greater than the number of samples, it is possible to find biologically relevant correlations of gene behavior with the sample categories or response variables. Identifying a reduced set of most relevant genes is the goal of gene selection. The small number of training samples and a large number of genes make gene selection a more relevant and challenging problem in gene expression-based classification. As this is a feature selection problem the clustering method can be used, which partitions the given gene set into subgroups, each of which should be as homogeneous as possible. The genes or attributes in a cluster are more correlated with each other, whereas genes in different clusters are less correlated. The attribute clustering is able to reduce the search dimension of a classification algorithm and constructs the model using a tightly correlated subset of genes rather than using the entire gene space. After clustering genes, a reduced set of genes can be selected for further analysis. The supervised attribute clustering is defined as the grouping of genes or attributes, controlled by the information of sample categories or response variables. In general, the quality of generated clusters is always relative to a certain criterion. Different criteria may lead to different clustering results. However, every criterion tries to measure the similarity among the subset of genes presented in a cluster. In supervised attribute clustering, the similarity between attributes is measured and redundancy among the attributes is eliminated. The information of response variables is incorporated in attribute clustering to find groups of co- regulated genes with strong association to the sample categories. The clusters are then refined incrementally based on sample categories. The biological significance of the generated clusters is interpreted using the gene ontology. Classification algorithms are implemented with the selected finer cluster and comparisons between classifiers are made to observe the Performance Evaluation of Classification Algorithms C4.5 and SMO on Microarray Gene-set Data Dr S.Senthamarai Kannan 1 , Sherin Mariam John 2 , P.Sundaravadivel 3 , S.Ilangovan 4 , and Dr A. Vincent Antony Kumar 5 A International Journal of Research in Engineering and Technology (IJRET) Vol. 2, No. 5, 2013 ISSN 2277 – 4378 275