S. Chaudhury et al. (Eds.): PReMI 2009, LNCS 5909, pp. 184–192, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Multiple Sequence Alignment Based Upon Statistical
Approach of Curve Fitting
Vineet Jha, Mohit Mazumder
*
, Hrishikesh Bhuyan, Ashwani Jha,
and Abhinav Nagar
InSilico Biosolution, 103B, North Guwahati College Road, Abhoypur, Near IIT-Guwahati,
P.O – College Nagar, North Guwahati – 781031, Assam, India
mazumder.mohit@gmail.com
Abstract. The main objective of our work is to align multiple sequences to-
gether on the basis of statistical approach in lieu of heuristics approach. Here
we are proposing a novel idea for aligning multiple sequences in which we will
be considering the DNA sequences as lines not as strings where each character
represents a point in the line. DNA sequences are aligned in such a way that
maximum overlap can occur between them, so that we get maximum matching
of characters which will be treated as our seeds of the alignment. The proposed
algorithm will first find the seeds in the aligning sequences and then it will
grow the alignment on the basis of statistical approach of curve fitting using
standard deviation.
Keywords: Multiple Sequence Alignment, Sequence Alignment, Word
Method, Statistically Optimized Algorithm, Comparative Genome Analysis,
Cross Referencing, Evolutionary Relationship.
1 Introduction
Multiple sequence alignment is a crucial prerequisite for biological sequence data
analysis.
It is a way of arranging the sequences of DNA, RNA, or protein to identify regions
of similarity that may be a consequence of functional, structural, or evolutionary
relationships between the sequences. A large number of multi-alignment programs
have been developed during last twenty years. There are three main considerations in
choosing a program: biological accuracy, execution time and memory usage. Biologi-
cal accuracy is generally the most important concern amongst all. Some of the promi-
nent and accurate programs according to most benchmarks are CLUSTAL W [1], DI-
ALIGN [2], T-COFFEE [3], MAFFT, MUSCLE, PROBCONS . An overview about
these tools and other established methods are given [4].
T-COFFEE is a prototypical consistency- based method which is still considered as
one of the most accurate program available. MAFFT and MUSCLE have a similar
design, building on work done by Gotoh in the 1990s that culminated in the PRRN
*
Corresponding author.