Distributed feature selection: A hesitant fuzzy correlation concept for microarray high-dimensional datasets Mohammad Kazem Ebrahimpour, Mahdi Eftekhari * Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran ARTICLE INFO Keywords: Distributed machine learning Distributed feature selection Hesitant fuzzy sets Microarray high dimensional datasets Divide and conquer feature selection ABSTRACT Feature selection has been the problem of interest for many years. Almost all existing feature selection ap- proaches use all training samples and features at once to select salient features. These approaches are named centralized methods; however, there are other approaches that split the training data on their dimensions in order to run each batch on different clusters (Machine) for the cases which we are dealing with ultra-big data. In this paper, a novel distributed feature selection approach based on hesitant fuzzy sets is proposed. First, datasets are horizontally (by their features) divided into some subsets according to the information energies of hesitant fuzzy sets and shufﬂing. Then, on each subset our HCPF (Hesitant fuzzy set based feature selection algorithm using Correlation coefﬁcients for Partitioning Features) is applied individually. Finally, a merging procedure is employed that updates the ﬁnal feature subset according to improvements in the classiﬁcation accuracy. The effectiveness of the proposed method has been evaluated by twenty two state-of-the-art distributed and centralized algorithms on eight well-known microarray high dimensional datasets. The experimental results reveal that the proposed method has achieved signiﬁcant results compared to the other approaches due to the statistical non-parametric Wilcoxon signed rank test. Our experiments conﬁrm that the proposed method is effective to tackle feature selection problem in terms of classiﬁcation accuracy and dimension reduction in ultra- high dimensional datasets. 1. Introduction In the last two decades, handling DNA microarray high dimen- sional datasets has created a new line of research in both machine learning [1–5] and bioinformatics [6–9]. These types of datasets suffer from small sample size and huge number of features since they mea- sure gene expression [10]. Therefore, feature selection [11] plays a crucial role in DNA microarray datasets, which removes the irrelevant and redundant features from the dataset. Thus, the learning algorithms concentrate on the important aspects of features that are useful for future predictions. Typically feature selection approaches are divided into three main groups [12–14]: ﬁlters, wrappers and embedded methods [13]. Filter approaches undergo feature selection process by considering nature characteristics of them [8]. Therefore, these approaches are fast and can be used when we are dealing with huge datasets; however, since they do not consider the classiﬁer/regressor in their decision making process, their performance is not as well as other model based approaches [15]. On the other hand, wrapper approaches train a model for evaluating candidate subsets. Thus, they are more accurate than ﬁlters; on contrary, since they train a model for each candidate subset, they are computa- tionally expensive. The embedded approaches try to obtain a good subset of features during the training phase. The logic behind these approaches is the more important features get higher weights in the trained model [16]. This idea makes much sense since higher weights means more impact on the outputs. The embedded approaches can be considered as a trade-off between ﬁlter and wrapper approaches since they are consid- erably accurate and fast. Traditional feature selection algorithms consider we can load the whole data in one computer; therefore, they apply machine learning algorithms on the whole dataset at once. We call this process central- ized machine learning [10,12,13,15]. For instance, Hoque et al. [15] considered the feature selection problem as an optimization multi-objective problem. They mentioned that feature selection has two general aims: selecting relevant features to class labels and avoiding redundancy among themselves. Thus, they used the multi-objective NSGA II algorithm in order to deal with it. Moreover, Canul-Reich et al. [17] introduced an iterative embedded feature selection algorithm * Corresponding author. E-mail address: m.eftekhari@uk.ac.ir (M. Eftekhari). Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics https://doi.org/10.1016/j.chemolab.2018.01.001 Received 3 September 2017; Received in revised form 29 November 2017; Accepted 4 January 2018 Available online 6 January 2018 0169-7439/© 2018 Elsevier B.V. All rights reserved. Chemometrics and Intelligent Laboratory Systems 173 (2018) 51–64