International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 9
140101-7575-IJET-IJENS © February 2014 IJENS
I J E N S
Analysis on Clustering Method for HMM-Based
Exon Controller of DNA Plasmodium falciparum for
Performance Improvement
Alfred Pakpahan
1
, Suhartati Agoes
2
, Binti Solihah
3
1
Department of Biology, Faculty of Dentistry, Trisakti University
2
Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University
3
Informatic Technology Department, Faculty of Industrial Technology, Trisakti University
Trisakti University, Jalan Kyai Tapa Grogol Jakarta 11440, Indonesia
Email:
1
alfred@trisakti.ac.id,
2
sagoes@trisakti.ac.id ,
3
binti76@yahoo.com
Abstract-- Improved performance of exon controller of
Deoxyribo Nucleic Acid (DNA) Plasmodium falciparum based on
Hidden Markov model (HMM) can be done with the application
of clustering methods on data in the process of training and
testing the HMM. Some Coding Sequence (CDS) data of DNA
Plasmodium falciparum as the input data can be used during
training to establish the model and the result of the formed
model are tested by a sequence of data and the calculated level of
familiarity to the data with a certain number of exons. Some
amount of state models can be implemented on HMM structure
to get the value of the model's performance is Correlation
Coefficient (CC) is optimal. This research also identified the
protein product similarity prediction results HMM models using
the Open Reading Frame (ORF) and the identification of
patterns of insertion and deletion of products associated with the
predicted results of exon length. The simulation results indicate
that increasing the number of states in the model is not linear to
the increase in the value of the performance of the model
compared to doing the clustering process HMM training and
testing have increased the value of the CC with the simulation
processing time is relatively short.
Index Term-- HMM, Plasmodium falciparum DNA, CDS,
clustering, CC
1. INTRODUCTION
The objective of this study is to control exon Deoxyribo
nucleic acid (DNA) in the coding sequence (CDS) to a protein
produced after going through the process of transcription and
translation has not changed so there is no indication that
changes generated against the protein. Exon controlling
process is similar with gene finding technique. As mention in
[1,2], there are two classes of method in gene prediction,
sequence similarity search and ab initio gene finding (gene
structure and signal based search. The limitation of the first
approach, as mention in [1] is the fact that only half of genes
being discovered have significant homology to genes in data
base. In ab initio method, there are several algorithms have
been developed, such as dynamic programming, Neural
Network, Markov Model, Hidden Markov Model. The most
successful program is Hidden Markov Model [1, 2].
One method that can be used to control the exon DNA is
the method of Hidden Markov model (HMM) which has
some of the parameters used are the number of states, the
value of the transition state, state emissions values and
algorithms used for training and testing process which Baum-
Welch algorithm and Viterbi. In this study implemented
HMM to control exons with simulation trials in the MATLAB
programming environment and one of the developed model
performance is expressed by the Correlation Coefficient (CC).
Model accuracy in controlling exon is indicated by the value
of the CC. Among the ways that have been used to increasethe
value of the CC is to add the number of HMM states until a
certain amount of state [3,4,5] and classify the training data
based on the number of exons in the CDS [6]. Increasing the
value by adding the value of state CC takes time training with
the tendency of the model and logarithmic search state
composition difficult. On the development of clustering
models with state despite an increase in the number of CC
but constrained by the limited training data. Therefore it is
necessary to identify other ways to optimize the model.
The goal of this study is to identify the relationship
between the value of the CC with the protein product
similarity prediction results compared to the original product ,
identifying the insertions and deletions on the results of the
model predictions compared with the original CDS , then do
the Fuzzy C-Mean clustering the training data to obtain
improved performance of the model existing and clustering
result is used to obtain a model that is specific to the
characteristics of the data. The results of the trial showed an
increase in the value of the CC compared with previous test
results in the same model structure.
2. MARKOV MODEL TO CONTROL EXON HIDDEN
Hidden Markov Model (HMM) is one of the stochastic
models consisting of a signal (the signal of DNA) that is
modeled as a Markov chain state (state) and a finite
observation corresponding observation process modeled on