METHODOLOGIES AND APPLICATION Novel machine learning approach for classification of high- dimensional microarray data Rabia Aziz Musheer 1,2 • C. K. Verma 2 • Namita Srivastava 2 Ó Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Independent component analysis (ICA) is a powerful concept for reducing the dimension of big data in many applications. It has been used for the feature extraction of microarray gene expression data in numerous works. One of the merits of ICA is that a number of extracted features are always equal to the number of samples. When ICA is applied to microarray data, whenever, it faces the challenges of how to ﬁnd the best subset of genes (features) from extracted features. To resolve this problem, in this paper, we propose a new (artiﬁcial bee colony) ABC-based feature selection approach for microarray data. Our approach is based on two stages: ICA-based extraction approach to reduce the size of data and ABC-based wrapper approach to optimize the reduced feature vectors. To validate our proposed approach, extensive experiments were con- ducted to compare the performance of ICA ? ABC with the results obtained from recently published and other previously suggested methods of gene selection for Naı ¨ve Bayes (NB) classiﬁer. To compare the performance of the proposed approach with other algorithms, a statistical hypothesis test was employed with six benchmark cancer classiﬁcation datasets of the microarray. The experimental result shows that the proposed approach demonstrates an improvement over all the algorithms for NB classiﬁer with a certain level of signiﬁcance. Keywords Independent component analysis (ICA)  Artiﬁcial bee colony (ABC)  Naı ¨ve Bayes (NB)  Cancer classiﬁcation 1 Introduction The ﬁeld of machine learning provides an application of computer-based approach which is appropriate for the analysis of different types of datasets, and these approaches are developed and improved with experience (Ahmadi and Mahmoudi 2016; Ahmadi 2015b; Ahmadi and Bahadori 2016; Ahmadi et al. 2015d; Ahmadi and Shadizadeh 2012). Machine learning techniques solve the problem of cluster- ing, classiﬁcation, prediction and various other problems by using the application of supervised, unsupervised and semi- supervised method (Ahmadi et al. 2014b, c, d, e, f, g; Ahmadi and Ebadi 2014). One of the major applications of microarray data analysis is to perform sample classiﬁcation for diagnostic and prognostic of disease. Some of the examples of machine learning techniques that have been used in cancer classiﬁcation of microarray data include the decision tree, neural networks, support vector machine and the Naı ¨ve Bayesian classiﬁer. However, small size of samples in comparison with high dimensionality is the main difﬁculty for most of the machine learning techniques. This problem is known as ‘curse of dimensionality.’ Dimension reduction is one of the main applications that plays an important role in the DNA microarray data classiﬁcation (Lazar et al. 2012; Saeys et al. 2007). For dimension reduction, there are two important algorithms: feature extraction and feature selection. Feature extraction algo- rithm transforms the feature into the lower-dimensional space by using the combinations of the original features. The feature selection method selects the most relevant features from the entire features to construct the model for classiﬁcation. Feature selection algorithms can be arranged into three types, namely ﬁlter, wrapper and embedded methods. Filter methods are the ones that select features as a Communicated by V. Loia. & Rabia Aziz Musheer rabia.musheer@vitbhopal.ac.in 1 Department of SASL (Mathematics), VIT University Bhopal, Bhopal, M.P. 466116, India 2 Department of Mathematics and Computer Application, Maulana Azad National Institute of Technology, Bhopal, M.P. 462003, India 123 Soft Computing https://doi.org/10.1007/s00500-019-03879-7