(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 2, 2021 193 | Page www.ijacsa.thesai.org Hybrid Feature Selection and Ensemble Learning Methods for Gene Selection and Cancer Classification Sultan Noman Qasem 1 , Faisal Saeed 2 Computer Science Department, College of Computer and Information Sciences 1 Al Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia 1 Computer Science Department, Faculty of Applied Science, Taiz University, Taiz, Yemen 1 Information Systems Department, College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia 2 Abstract—A promising research field in bioinformatics and data mining is the classification of cancer based on gene expression results. Efficient sample classification is not supported by all genes. Thus, to identify the appropriate genes that help efficiently distinguish samples, a robust feature selection method is needed. Redundancy in the data on gene expression contributes to low classification performance. This paper presents the combination for gene selection and classification methods using ranking and wrapper methods. In ranking methods, information gain was used to reduce the size of dimensionality to 1% and 5%. Then, in wrapper methods K-nearest neighbors and Naïve Bayes were used with Best First, Greedy Stepwise, and Rank Search. Several combinations were investigated because it is known that no single model can give the best results using different datasets for all circumstances. Therefore, combining multiple feature selection methods and applying different classification models could provide a better decision on the final predicted cancer types. Compared with the existing classifiers, the proposed assembly gene selection methods obtained comparable performance. Keywords—Microarray; gene selection; ensemble classification; cancer classification; gene expression I. INTRODUCTION Gene expression is called the process of transcription of the Deoxyribo Nucleic Acid (DNA) sequence into Ribo Nucleic Acid (RNA). The expression frequency of a gene shows the average number of copies of the cell-produced RNA in that gene and is associated with the corresponding volume of protein [1]. Microarray is the technique for simultaneous measurements of the expression level in a single chip of tens of thousands of genes. Microarrays therefore provide an effective way to collect data that can be used to establish the pattern of expression of thousands of genes. In most classification issues, high gene expression data is a major challenge. Therefore, not all genes also lead to cancer. A broad variety of genes have no clinical importance or insignificance. However, incorrect diagnosis can also be accomplished by using both genes in the Microarray classification of gene expression. The two key explanations for low classification precision are two: large number of features (genes) against limited sample size and dimensional consistency in articulated data [2]. Subsequently, the decrease in dimensions is necessary. Standard machine learning methods have not been effective, since these methods are better suited when there are more samples than features. In order to solve these problems, selection algorithms for dimension reduction or features (gene) were used. The gene selection methods are usually divided into three groups, namely filter, wrapper and embedded methods. The filter procedure requires the individual evaluation of each feature using its statistical characteristics in general. The wrapper approach uses training strategies to choose the best subset of features. By the precision of the particular classifier the efficiency of the wrapper technique is calculated. In the wrapper method evolutionary or bio-inspired algorithms are also used to direct the search process. The embedded approach aims for the best feature subset and is implemented in the classification scheme. The general structure for feature selection was recently complemented with hybrid and ensemble approaches. The filter and the wrapper approaches are designed to take advantage of hybrid. Extensive works have investigated this issue and proposed several methods such as [3-16]. Several feature selection methods have been applied. For instance, the authors in [17-19] proposed hybrid methods to combine filter and wrapper algorithms to overcome the disadvantage of each individual one. Conventional optimization algorithms are not efficiently working in the feature selection of large scale problems [20]. Alternatively, different meta-heuristic algorithms have been adapted for feature selection issues. Examples of these algorithms are Genetic Algorithm (GA) [21], Ant Colony Optimization [22], Simulated Annealing [23], and Particle Swarm Optimization (PSO) [24, 25]. In addition, a modified support vector machine (SVM) was also suggested to select the minimum possible genes [26]. Multi-objective version of bat algorithm for binary feature selection [27] and Genetic Bee Colony (GBC) algorithm [28] were successfully utilized in high dimensional datasets. Moreover, a hybrid feature selection algorithm was proposed that combines the mutual information maximization (MIM) and the adaptive genetic algorithm (AGA) [19]. The reduced gene expression dataset presented higher classification accuracy compared with conventional feature selection algorithms. In addition, a binary version of Black Hole Algorithm called BBHA was proposed for solving feature selection problem in biological data. However, the tested classifiers were under tree family, and other kinds of classifiers were not assessed [29]. Along this line, the assessment of different classifiers such as artificial neural network (ANN) [30] and