OBSERVATIONS ON USING PROBABILISTIC C-MEANS FOR SOLVING A TYPICAL BIOINFORMATICS PROBLEM J. Mohammadzadeh , A. Ghazinezhad £ , A. Rasooli Valaghozi & , A. Nadi E. Asgarian , V. Salmani , A. Najafi-Ardabili , M-H. Moeinzadeh School of Mathematics, Statistics and Computer Science, University of Tehran, Iran Department of Computer engineering, FerdowsiUniversity of Mashhad, Mashhad, Iran & Department of Applied Mathematic, Iran University of Science and Technology, Iran Department of Computer Engineering, Sharif University of Technology, Tehran, Iran Department of Electrical Engineering and Computer Science, University of California, Irvine, USA Abstract Recently, there has been great interest in Bioinformatics among researches from various disciplines such as computer science, mathematics, statistics and artificial intelligence. Bioinformatics mainly deals with solving biological problems at molecular levels. One of the classic problems of bioinformatics which has gain a lot attention lately is Haplotyping, the goal of which is categorizing SNP-fragments into two clusters and deducing a haplotype for each. Since the problem is proved to be NP-hard, several computational and heuristic methods have addressed the problem seeking feasible answers. In this work it is shown that using PCM to solve Haplotyping problem in DALY dataset yields better results comparing to current available methods. 1. Introduction Having complete human genome makes the process of investigation and analysis of complicated diseases simpler. A haplotype is consisted of a string of nucleotides called alleles which are available in two types ‘A’ and ‘B’. There are two haplotypes in every human being namely maternal and paternal. One of the problems facing researchers is to theoretically infer haplotypes from SNP-fragments, a costly task which is difficult to carry out in laboratory circumstances. So far, several computational models have been used to solve the problem. In the current work, Minimum Error Correction (MEC) model is used which seeks to minimize the difference between original haplotypes and the ones inferred through computation. Genetic Algorithm (GA) was used to solve aforementioned problem. Binary strings, with two alphabets of length m as the chromosome, were divided fragments into two clusters. Then for each cluster, one haplotype was inferred [8]. AGO which is a greedy algorithm in three different versions, was also used to solve the problem [9]. This method iteratively combined the nearest two fragments and the results substituted those fragments. Therefore, the number of fragments decreased until the last two ones remained. These two fragments were considered as final haplotypes. A two layer competitive unsupervised neural network (UWNN) has been designed to solve MEC and MEC/GI models [10]. Fragments are fed to the neural network consecutively. While, the first layer was made up of ‘n’ units (SNP layer) each node of which related to one SNP site, the second one was composed of two units. Two strings, called semi-haplotypes (haplotypes represented with decimal numbers) were reconstructed after appropriate epochs. One of secondary layer nodes, which competed based on the similarity of each fragment and semi- haplotypes, was marked as the winner. The weights of the winner node were updated in each epoch. After appropriate epochs the semi haplotypes were constructed on neural networks weights. 2. Problem definitions To formulate haplotype reconstruction problem a matrix of SNP Fragments, M n×m (n: number of fragments, m: length of each fragment), is considered as an input. Each row of the matrix is one SNP fragment. For more convenience, fragment is used instead of SNP fragments in the rest of the paper. 1 2 1 2 { , ,..., } , [ , ,..., ] n nm F f f f Set of All Fragments M f f f Fragments Matrix × = = r r r r r r Each fragment entry of the matrix is filled with ‘A’, ‘B’ and ‘-’, the latter corresponding to missing or skipped data. Second UKSIM European Symposium on Computer Modeling and Simulation 978-0-7695-3325-4/08 $25.00 © 2008 IEEE DOI 10.1109/EMS.2008.96 236 Authorized licensed use limited to: Ferdowsi University of Mashhad Trial User. Downloaded on December 13, 2008 at 11:37 from IEEE Xplore. Restrictions apply.