Uncorrected Author Proof Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx DOI:10.3233/JIFS-171297 IOS Press 1 An efficient search algorithm for biomarker selection from RNA-seq prostate cancer data 1 2 Saleh Shahbeig, Akbar Rahideh , Mohammad Sadegh Helfroush and Kamran Kazemi 3 Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran 4 Abstract. RNA-sequencing technology helps to consider the expression of thousands of genes, simultaneously. The large- scale gene expression data include a huge number of genes versus a few samples. Therefore, the algorithms that among huge number of unrelated genes can accurately detect genes associated with specific disease can be useful for experts in early detect and treat the disease. 5 6 7 8 A two-phase search algorithm is proposed in this paper to discover the biomarkers in the RNA-seq gene expression dataset for the prostate cancer diagnosis. After statistical noise removing from the original large-scale dataset, a multi-objective optimization process is proposed to select the best non-dominated subset of genes with the maximum classification accuracy and the minimum number of genes, simultaneously. Finally, the proposed cache-based modification of the sequential forward floating selection (CMSFFS) algorithm is applied to the selected subset of genes to discover the most discriminant genes. 9 10 11 12 13 The obtained results show that the proposed algorithm is able to achieve the classification accuracy, sensitivity and specificity of 100% in the large scale RNA-seq prostate cancer dataset by selecting only three biomarkers. 14 15 Keywords: RNA-seq, large-scale prostate cancer data, two-phase search algorithm, multi-objective-based optimization, CMSFFS 16 17 1. Introduction 18 The prostate is a small gland in the male repro- 19 ductive system. Prostate cancer, which is also known 20 as carcinoma of the prostate, occurs when some of 21 the cells of the prostate, are reproduced much faster 22 than normal case. RNA-sequencing data are known as 23 large-scale data. Scientists use the large-scale data to 24 check the differences between the normal and abnor- 25 mal cells in the human body. Because of the high 26 cost of genetic tests, the number of samples com- 27 pared to the number of extracted genes is very low 28 in the large-scale data. This is the main reason for 29 the poor performance of many reported gene selec- 30 tion techniques. It is of great importance to develop 31 Corresponding author. Akbar Rahideh, Department of Electri- cal and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran. E-mail: rahide@sutech.ac.ir. an algorithm for the accurate detection of the genes 32 associated with prostate cancer among huge number 33 of unrelated genes. A number of researches dealing 34 with the gene selection process from the large-scale 35 prostate cancer data have been reported as follows. 36 Chiang and Ho [1] have combined the rough-based 37 feature selection method with the radial basis func- 38 tion (RBF) neural network for the classification of 39 gene expression data. This method can find the rele- 40 vant features without requiring the number of clusters 41 to be known a priori and identify the centers that 42 approximate to the correct ones. In this paper, the 43 authors have been attempted to introduce a predic- 44 tion scheme that combines the rough-based feature 45 selection method with radial basis function neural 46 network. A hybrid gene selection method (IG-SVM) 47 has been proposed in [2] to select informative genes 48 for cancer classification. IG is a filter method that 49 can eliminate irrelevant features in high-dimensional 50 1064-1246/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved