Indonesian Journal of Electrical Engineering and Computer Science Vol. 11, No. 3, September 2018, pp. 1223~1227 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v11.i3.pp1223-1227 1223 Journal homepage: http://iaescore.com/journals/index.php/ijeecs An Effective Pre-Processing Phase for Gene Expression Classification Choon Sen Seah 1 , Shahreen Kasim 2 , Mohd Farhan Md Fudzee 3 , Mohd Saberi Mohamad 4 , Rd Rohmat Saedudin 5 , Rohayanti Hassan 6 , Mohd Arfian Ismail 7 , Rodziah Atan 8 1,2,3 Soft Computing and Data Mining Centre, Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia 4 Faculty of Creative Technology and Heritage, Universiti Malaysia Kelantan, Karung Berkunci 01, 16300, Bachok, Kelantan, Malaysia 5 School of Industrial Engineering, Telkom University, 40257 Bandung, West Java, Indonesia 6 Laboratory of Biodiversity and Bioinformatics, Universiti Teknologi Malaysia, 81300 Skudai, Johor, Malaysia 7 Faculty of Computer Systems and Software Engineering, Universiti Malaysia Pahang, Pahang, Malaysia 8 Department of Software Engineering & Information System, Faculty of Computer Science and Information Technology, University Putra Malaysia (UPM), 43400 Selangor, Serdang, Malaysia Article Info ABSTRACT Article history: Received Apr 5, 2018 Revised Jun 6, 2018 Accepted Jun 20, 2018 A raw dataset prepared by researchers comes with a lot of information. Whether the information is usefull or not, completely depends on the requirement and purposes. In machine learning, data pre-processing is the very initial stage. It is a must to make sure the dataset is totally suitable for the requirement. In significant directed random walk (sDRW), there are three steps in data pre-processing stage. First, we remove unwanted attributes, missing value and proper arrangement, followed by normalization of the expression value and lastly, filtering method is applied. The first two steps are completed by Bioconductor package while the last step is works in sDRW. Keywords: Bioconductor Data pre-processing Gene expression dataset Significant directed random walk Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Choon Sen Seah, Soft Computing and Data Mining Centre, Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia. Email: seanseah0702@gmail.com 1. INTRODUCTION Microarray technology is a branch of biology technology which aims to study the expression of genes from the cell [1]. It places the gene sequences on a glass slide called gene chip. The gene chip is designed to display the sequences of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). Complementary base pairing between the sample cell and gene sequences on the chip produces different colours based on the expression level of the gene. The introduction of microarray technology allows researchers to analyse thousands of gene expression profiles simultaneously [2-5]. The datasets produced by microarray technology is known as gene expression dataset [2]. Much biomedical research, especially cancerous research, has been increased. However, the properties of large dimension would affect the result of research as well. Since the microarray dataset is large dimension, classifying and computing the algorithms becomes more complex to study the gene expression characteristics [6]. Besides that, microarray datasets have many improper attributes and missing values might occur after the first collection of dataset. The accuracy of the classification algorithm would be affected.