International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 9 140101-7575-IJET-IJENS © February 2014 IJENS I J E N S Analysis on Clustering Method for HMM-Based Exon Controller of DNA Plasmodium falciparum for Performance Improvement Alfred Pakpahan 1 , Suhartati Agoes 2 , Binti Solihah 3 1 Department of Biology, Faculty of Dentistry, Trisakti University 2 Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University 3 Informatic Technology Department, Faculty of Industrial Technology, Trisakti University Trisakti University, Jalan Kyai Tapa Grogol Jakarta 11440, Indonesia Email: 1 alfred@trisakti.ac.id, 2 sagoes@trisakti.ac.id , 3 binti76@yahoo.com Abstract-- Improved performance of exon controller of Deoxyribo Nucleic Acid (DNA) Plasmodium falciparum based on Hidden Markov model (HMM) can be done with the application of clustering methods on data in the process of training and testing the HMM. Some Coding Sequence (CDS) data of DNA Plasmodium falciparum as the input data can be used during training to establish the model and the result of the formed model are tested by a sequence of data and the calculated level of familiarity to the data with a certain number of exons. Some amount of state models can be implemented on HMM structure to get the value of the model's performance is Correlation Coefficient (CC) is optimal. This research also identified the protein product similarity prediction results HMM models using the Open Reading Frame (ORF) and the identification of patterns of insertion and deletion of products associated with the predicted results of exon length. The simulation results indicate that increasing the number of states in the model is not linear to the increase in the value of the performance of the model compared to doing the clustering process HMM training and testing have increased the value of the CC with the simulation processing time is relatively short. Index Term-- HMM, Plasmodium falciparum DNA, CDS, clustering, CC 1. INTRODUCTION The objective of this study is to control exon Deoxyribo nucleic acid (DNA) in the coding sequence (CDS) to a protein produced after going through the process of transcription and translation has not changed so there is no indication that changes generated against the protein. Exon controlling process is similar with gene finding technique. As mention in [1,2], there are two classes of method in gene prediction, sequence similarity search and ab initio gene finding (gene structure and signal based search. The limitation of the first approach, as mention in [1] is the fact that only half of genes being discovered have significant homology to genes in data base. In ab initio method, there are several algorithms have been developed, such as dynamic programming, Neural Network, Markov Model, Hidden Markov Model. The most successful program is Hidden Markov Model [1, 2]. One method that can be used to control the exon DNA is the method of Hidden Markov model (HMM) which has some of the parameters used are the number of states, the value of the transition state, state emissions values and algorithms used for training and testing process which Baum- Welch algorithm and Viterbi. In this study implemented HMM to control exons with simulation trials in the MATLAB programming environment and one of the developed model performance is expressed by the Correlation Coefficient (CC). Model accuracy in controlling exon is indicated by the value of the CC. Among the ways that have been used to increasethe value of the CC is to add the number of HMM states until a certain amount of state [3,4,5] and classify the training data based on the number of exons in the CDS [6]. Increasing the value by adding the value of state CC takes time training with the tendency of the model and logarithmic search state composition difficult. On the development of clustering models with state despite an increase in the number of CC but constrained by the limited training data. Therefore it is necessary to identify other ways to optimize the model. The goal of this study is to identify the relationship between the value of the CC with the protein product similarity prediction results compared to the original product , identifying the insertions and deletions on the results of the model predictions compared with the original CDS , then do the Fuzzy C-Mean clustering the training data to obtain improved performance of the model existing and clustering result is used to obtain a model that is specific to the characteristics of the data. The results of the trial showed an increase in the value of the CC compared with previous test results in the same model structure. 2. MARKOV MODEL TO CONTROL EXON HIDDEN Hidden Markov Model (HMM) is one of the stochastic models consisting of a signal (the signal of DNA) that is modeled as a Markov chain state (state) and a finite observation corresponding observation process modeled on