Uncorrected Author Proof
Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx
DOI:10.3233/JIFS-171297
IOS Press
1
An efficient search algorithm for biomarker
selection from RNA-seq prostate cancer data
1
2
Saleh Shahbeig, Akbar Rahideh
∗
, Mohammad Sadegh Helfroush and Kamran Kazemi 3
Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
4
Abstract. RNA-sequencing technology helps to consider the expression of thousands of genes, simultaneously. The large-
scale gene expression data include a huge number of genes versus a few samples. Therefore, the algorithms that among huge
number of unrelated genes can accurately detect genes associated with specific disease can be useful for experts in early
detect and treat the disease.
5
6
7
8
A two-phase search algorithm is proposed in this paper to discover the biomarkers in the RNA-seq gene expression dataset
for the prostate cancer diagnosis. After statistical noise removing from the original large-scale dataset, a multi-objective
optimization process is proposed to select the best non-dominated subset of genes with the maximum classification accuracy
and the minimum number of genes, simultaneously. Finally, the proposed cache-based modification of the sequential forward
floating selection (CMSFFS) algorithm is applied to the selected subset of genes to discover the most discriminant genes.
9
10
11
12
13
The obtained results show that the proposed algorithm is able to achieve the classification accuracy, sensitivity and
specificity of 100% in the large scale RNA-seq prostate cancer dataset by selecting only three biomarkers.
14
15
Keywords: RNA-seq, large-scale prostate cancer data, two-phase search algorithm, multi-objective-based optimization,
CMSFFS
16
17
1. Introduction 18
The prostate is a small gland in the male repro- 19
ductive system. Prostate cancer, which is also known 20
as carcinoma of the prostate, occurs when some of 21
the cells of the prostate, are reproduced much faster 22
than normal case. RNA-sequencing data are known as 23
large-scale data. Scientists use the large-scale data to 24
check the differences between the normal and abnor- 25
mal cells in the human body. Because of the high 26
cost of genetic tests, the number of samples com- 27
pared to the number of extracted genes is very low 28
in the large-scale data. This is the main reason for 29
the poor performance of many reported gene selec- 30
tion techniques. It is of great importance to develop 31
∗
Corresponding author. Akbar Rahideh, Department of Electri-
cal and Electronics Engineering, Shiraz University of Technology,
Shiraz, Iran. E-mail: rahide@sutech.ac.ir.
an algorithm for the accurate detection of the genes 32
associated with prostate cancer among huge number 33
of unrelated genes. A number of researches dealing 34
with the gene selection process from the large-scale 35
prostate cancer data have been reported as follows. 36
Chiang and Ho [1] have combined the rough-based 37
feature selection method with the radial basis func- 38
tion (RBF) neural network for the classification of 39
gene expression data. This method can find the rele- 40
vant features without requiring the number of clusters 41
to be known a priori and identify the centers that 42
approximate to the correct ones. In this paper, the 43
authors have been attempted to introduce a predic- 44
tion scheme that combines the rough-based feature 45
selection method with radial basis function neural 46
network. A hybrid gene selection method (IG-SVM) 47
has been proposed in [2] to select informative genes 48
for cancer classification. IG is a filter method that 49
can eliminate irrelevant features in high-dimensional 50
1064-1246/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved