Distributed feature selection: A hesitant fuzzy correlation concept for
microarray high-dimensional datasets
Mohammad Kazem Ebrahimpour, Mahdi Eftekhari
*
Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
ARTICLE INFO
Keywords:
Distributed machine learning
Distributed feature selection
Hesitant fuzzy sets
Microarray high dimensional datasets
Divide and conquer feature selection
ABSTRACT
Feature selection has been the problem of interest for many years. Almost all existing feature selection ap-
proaches use all training samples and features at once to select salient features. These approaches are named
centralized methods; however, there are other approaches that split the training data on their dimensions in
order to run each batch on different clusters (Machine) for the cases which we are dealing with ultra-big data. In
this paper, a novel distributed feature selection approach based on hesitant fuzzy sets is proposed. First, datasets
are horizontally (by their features) divided into some subsets according to the information energies of hesitant
fuzzy sets and shuffling. Then, on each subset our HCPF (Hesitant fuzzy set based feature selection algorithm
using Correlation coefficients for Partitioning Features) is applied individually. Finally, a merging procedure is
employed that updates the final feature subset according to improvements in the classification accuracy. The
effectiveness of the proposed method has been evaluated by twenty two state-of-the-art distributed and
centralized algorithms on eight well-known microarray high dimensional datasets. The experimental results
reveal that the proposed method has achieved significant results compared to the other approaches due to the
statistical non-parametric Wilcoxon signed rank test. Our experiments confirm that the proposed method is
effective to tackle feature selection problem in terms of classification accuracy and dimension reduction in ultra-
high dimensional datasets.
1. Introduction
In the last two decades, handling DNA microarray high dimen-
sional datasets has created a new line of research in both machine
learning [1–5] and bioinformatics [6–9]. These types of datasets suffer
from small sample size and huge number of features since they mea-
sure gene expression [10]. Therefore, feature selection [11] plays a
crucial role in DNA microarray datasets, which removes the irrelevant
and redundant features from the dataset. Thus, the learning algorithms
concentrate on the important aspects of features that are useful for
future predictions.
Typically feature selection approaches are divided into three main
groups [12–14]: filters, wrappers and embedded methods [13]. Filter
approaches undergo feature selection process by considering nature
characteristics of them [8]. Therefore, these approaches are fast and can
be used when we are dealing with huge datasets; however, since they do
not consider the classifier/regressor in their decision making process,
their performance is not as well as other model based approaches [15].
On the other hand, wrapper approaches train a model for evaluating
candidate subsets. Thus, they are more accurate than filters; on contrary,
since they train a model for each candidate subset, they are computa-
tionally expensive. The embedded approaches try to obtain a good subset
of features during the training phase. The logic behind these approaches
is the more important features get higher weights in the trained model
[16]. This idea makes much sense since higher weights means more
impact on the outputs. The embedded approaches can be considered as a
trade-off between filter and wrapper approaches since they are consid-
erably accurate and fast.
Traditional feature selection algorithms consider we can load the
whole data in one computer; therefore, they apply machine learning
algorithms on the whole dataset at once. We call this process central-
ized machine learning [10,12,13,15]. For instance, Hoque et al. [15]
considered the feature selection problem as an optimization
multi-objective problem. They mentioned that feature selection has two
general aims: selecting relevant features to class labels and avoiding
redundancy among themselves. Thus, they used the multi-objective
NSGA II algorithm in order to deal with it. Moreover, Canul-Reich et
al. [17] introduced an iterative embedded feature selection algorithm
* Corresponding author.
E-mail address: m.eftekhari@uk.ac.ir (M. Eftekhari).
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems
journal homepage: www.elsevier.com/locate/chemometrics
https://doi.org/10.1016/j.chemolab.2018.01.001
Received 3 September 2017; Received in revised form 29 November 2017; Accepted 4 January 2018
Available online 6 January 2018
0169-7439/© 2018 Elsevier B.V. All rights reserved.
Chemometrics and Intelligent Laboratory Systems 173 (2018) 51–64