Indonesian Journal of Electrical Engineering and Computer Science
Vol. 11, No. 3, September 2018, pp. 1223~1227
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v11.i3.pp1223-1227 1223
Journal homepage: http://iaescore.com/journals/index.php/ijeecs
An Effective Pre-Processing Phase for Gene Expression
Classification
Choon Sen Seah
1
, Shahreen Kasim
2
, Mohd Farhan Md Fudzee
3
, Mohd Saberi Mohamad
4
,
Rd Rohmat Saedudin
5
, Rohayanti Hassan
6
, Mohd Arfian Ismail
7
, Rodziah Atan
8
1,2,3
Soft Computing and Data Mining Centre, Faculty of Computer Sciences and Information Technology, Universiti Tun
Hussein Onn Malaysia
4
Faculty of Creative Technology and Heritage, Universiti Malaysia Kelantan, Karung Berkunci 01, 16300, Bachok,
Kelantan, Malaysia
5
School of Industrial Engineering, Telkom University, 40257 Bandung, West Java, Indonesia
6
Laboratory of Biodiversity and Bioinformatics, Universiti Teknologi Malaysia, 81300 Skudai, Johor, Malaysia
7
Faculty of Computer Systems and Software Engineering, Universiti Malaysia Pahang, Pahang, Malaysia
8
Department of Software Engineering & Information System, Faculty of Computer Science and Information Technology,
University Putra Malaysia (UPM), 43400 Selangor, Serdang, Malaysia
Article Info ABSTRACT
Article history:
Received Apr 5, 2018
Revised Jun 6, 2018
Accepted Jun 20, 2018
A raw dataset prepared by researchers comes with a lot of information.
Whether the information is usefull or not, completely depends on the
requirement and purposes. In machine learning, data pre-processing is the
very initial stage. It is a must to make sure the dataset is totally suitable for
the requirement. In significant directed random walk (sDRW), there are three
steps in data pre-processing stage. First, we remove unwanted attributes,
missing value and proper arrangement, followed by normalization of the
expression value and lastly, filtering method is applied. The first two steps
are completed by Bioconductor package while the last step is works in
sDRW.
Keywords:
Bioconductor
Data pre-processing
Gene expression dataset
Significant directed random
walk
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Choon Sen Seah,
Soft Computing and Data Mining Centre,
Faculty of Computer Sciences and Information Technology,
Universiti Tun Hussein Onn Malaysia.
Email: seanseah0702@gmail.com
1. INTRODUCTION
Microarray technology is a branch of biology technology which aims to study the expression of
genes from the cell [1]. It places the gene sequences on a glass slide called gene chip. The gene chip is
designed to display the sequences of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA).
Complementary base pairing between the sample cell and gene sequences on the chip produces different
colours based on the expression level of the gene. The introduction of microarray technology allows
researchers to analyse thousands of gene expression profiles simultaneously [2-5]. The datasets produced by
microarray technology is known as gene expression dataset [2]. Much biomedical research, especially
cancerous research, has been increased. However, the properties of large dimension would affect the result of
research as well. Since the microarray dataset is large dimension, classifying and computing the algorithms
becomes more complex to study the gene expression characteristics [6].
Besides that, microarray datasets have many improper attributes and missing values might occur
after the first collection of dataset. The accuracy of the classification algorithm would be affected.