International Journal of Research in Advent Technology, Vol.7, No.5, May 2019 E-ISSN: 2321-9637 Available online at www.ijrat.org 112 Ensemble based Fuzzy-Rough Nearest Neighbor Approach for Classification of Cancer from Microarray Data Ansuman Kumar 1 , Anindya Halder 2 1 Dept. of Computer Application, North-Eastern Hill University, Tura Campus, Meghalaya 794002, India. 2 Corresponding Author’s Email: anindya.halder@gmail.com Abstract- Cancer sample classification from gene expression data is one of the challenging areas of research in biomedical engineering and machine learning. In gene expression data, labeled samples are very limited in comparison to unlabeled samples; and labeling of unlabeled data is costly. Therefore, single classifier trained with limited training samples often fails to produce required accuracy. In such situation, ensemble technique can be effective as it combines the results of individual classifier which can improve the cancer classification accuracy. In this article a novel ensemble based fuzzy-rough nearest neighbour (EnFRNN) for cancer sample classification from microarray gene expression data is proposed. The proposed method is able to deal the uncertainty, overlapping and indiscernibility generally present in cancer subtype classes of the gene expression data. The proposed ensemble classifier is tested on eight publicly available microarray gene expression datasets. Experimental results suggest that the performance of the proposed ensemble classifier provides better results in comparison to individual classifier for cancer classification from gene expression data. In summary, fuzzy-rough based ensemble learning method turns out to be very effective in cancer sample classification from gene expression data particularly when the individual classifier result is not up to the mark with limited training samples. Index Terms- Cancer classification; Ensemble technique; Microarray gene expression data; Fuzzy set; Rough set. 1. INRODUCTION Traditional clinical methods for cancer sample classification rely on the clinical findings and the morphological exhibition of the tumor. These techniques are costly and time consuming. The recent development of microarray technology [1] has enabled biologists to specify thousands of genes in a single experiment in order to produce comparatively low-cost diagnosis and prediction of cancer at early stage. Different machine learning techniques have been applied for microarray gene expression data analysis using supervised (i.e., classification) [2], unsupervised (i.e., clustering) [3], semi-supervised clustering [4], and semi-supervised classification [5] mode. Generally, the number of samples present in microarray gene expression data is very less compared to the number of genes [6]; and the classes present in data are often vague and overlapping in nature. Therefore, the traditional classifiers often fail to achieve required accuracy. In this circumstance, the ensemble technique [7] is supposed to be useful as it judiciously combines the predications of the individual classifiers to produce the final decision which are expected to be better than any individual classifier. Ensemble technique is the learning model that achieves performance by combining the opinions of multiple base classifiers [7]. Ensemble technique uses many base classifiers, and combines their opinions in such a way that the combination result will improve the performance compared to any individual classifier [7]. The heterogeneity among the base classifiers and diversity in the training data set are the basic ideas to success of ensemble technique. Varieties of popular ensemble algorithms are proposed in the literature, viz., Bagging, Boosting, AdaBoost, and Random Forest [8]. Ensemble methods have the ability to deal with small sample size and high dimensionality. Therefore, ensemble methods have been widely applied to microarray gene expression data. A notable review of ensemble methods applied in bioinformatics may be found in [9]. Several pioneered work to classify cancer from the microarray gene expression data are proposed. Dettling and Buhlmann [10] proposed boosting for tumor classification with gene expression data. Osareh and Shadgar [11] provided an efficient ensemble learning method using RotBoost ensemble methodology. Valentini et al. [12] introduced bagged ensembles of support vector machines for cancer recognition. However, those ensemble methods are not able to handle the uncertainty, ambiguity, overlappingness and vagueness often present in the gene expression data. Therefore, in this work an ensemble technique using fuzzy-rough nearest neighbour (EnFRNN) is proposed for cancer sample classification from gene expression data (to improve the prediction accuracy of any individual classifier) which can handle the possible presence of uncertainty, ambiguity, vagueness, indiscernibility, overlappingness in the cancer subtype classes. The remainder of the article is structured as follows. The background theory related to this article is