OBSERVATIONS ON USING PROBABILISTIC C-MEANS FOR SOLVING A
TYPICAL BIOINFORMATICS PROBLEM
J. Mohammadzadeh
†
, A. Ghazinezhad
£
, A. Rasooli Valaghozi
&
, A. Nadi
E. Asgarian
, V. Salmani
, A. Najafi-Ardabili
, M-H. Moeinzadeh
†
†
School of Mathematics, Statistics and Computer Science, University of Tehran, Iran
Department of Computer engineering, FerdowsiUniversity of Mashhad, Mashhad, Iran
&
Department of Applied Mathematic, Iran University of Science and Technology, Iran
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Abstract
Recently, there has been great interest in
Bioinformatics among researches from various
disciplines such as computer science,
mathematics, statistics and artificial intelligence.
Bioinformatics mainly deals with solving
biological problems at molecular levels. One of
the classic problems of bioinformatics which has
gain a lot attention lately is Haplotyping, the
goal of which is categorizing SNP-fragments into
two clusters and deducing a haplotype for each.
Since the problem is proved to be NP-hard,
several computational and heuristic methods
have addressed the problem seeking feasible
answers. In this work it is shown that using PCM
to solve Haplotyping problem in DALY dataset
yields better results comparing to current
available methods.
1. Introduction
Having complete human genome makes the
process of investigation and analysis of
complicated diseases simpler. A haplotype is
consisted of a string of nucleotides called alleles
which are available in two types ‘A’ and ‘B’.
There are two haplotypes in every human being
namely maternal and paternal. One of the
problems facing researchers is to theoretically
infer haplotypes from SNP-fragments, a costly
task which is difficult to carry out in laboratory
circumstances. So far, several computational
models have been used to solve the problem. In
the current work, Minimum Error Correction
(MEC) model is used which seeks to minimize
the difference between original haplotypes and
the ones inferred through computation.
Genetic Algorithm (GA) was used to solve
aforementioned problem. Binary strings, with
two alphabets of length m as the chromosome,
were divided fragments into two clusters. Then
for each cluster, one haplotype was inferred [8].
AGO which is a greedy algorithm in three
different versions, was also used to solve the
problem [9]. This method iteratively combined
the nearest two fragments and the results
substituted those fragments. Therefore, the
number of fragments decreased until the last two
ones remained. These two fragments were
considered as final haplotypes.
A two layer competitive unsupervised neural
network (UWNN) has been designed to solve
MEC and MEC/GI models [10]. Fragments are
fed to the neural network consecutively. While,
the first layer was made up of ‘n’ units (SNP
layer) each node of which related to one SNP
site, the second one was composed of two units.
Two strings, called semi-haplotypes (haplotypes
represented with decimal numbers) were
reconstructed after appropriate epochs. One of
secondary layer nodes, which competed based on
the similarity of each fragment and semi-
haplotypes, was marked as the winner. The
weights of the winner node were updated in each
epoch. After appropriate epochs the semi
haplotypes were constructed on neural networks
weights.
2. Problem definitions
To formulate haplotype reconstruction problem a
matrix of SNP Fragments, M
n×m
(n: number of
fragments, m: length of each fragment), is
considered as an input. Each row of the matrix is
one SNP fragment. For more convenience,
fragment is used instead of SNP fragments in the
rest of the paper.
1 2
1 2
{ , ,..., } ,
[ , ,..., ]
n
nm
F f f f Set of All Fragments
M f f f Fragments Matrix
×
=
=
r r r
r r r
Each fragment entry of the matrix is filled with
‘A’, ‘B’ and ‘-’, the latter corresponding to
missing or skipped data.
Second UKSIM European Symposium on Computer Modeling and Simulation
978-0-7695-3325-4/08 $25.00 © 2008 IEEE
DOI 10.1109/EMS.2008.96
236
Authorized licensed use limited to: Ferdowsi University of Mashhad Trial User. Downloaded on December 13, 2008 at 11:37 from IEEE Xplore. Restrictions apply.