S. Chaudhury et al. (Eds.): PReMI 2009, LNCS 5909, pp. 184–192, 2009. © Springer-Verlag Berlin Heidelberg 2009 Multiple Sequence Alignment Based Upon Statistical Approach of Curve Fitting Vineet Jha, Mohit Mazumder * , Hrishikesh Bhuyan, Ashwani Jha, and Abhinav Nagar InSilico Biosolution, 103B, North Guwahati College Road, Abhoypur, Near IIT-Guwahati, P.O – College Nagar, North Guwahati – 781031, Assam, India mazumder.mohit@gmail.com Abstract. The main objective of our work is to align multiple sequences to- gether on the basis of statistical approach in lieu of heuristics approach. Here we are proposing a novel idea for aligning multiple sequences in which we will be considering the DNA sequences as lines not as strings where each character represents a point in the line. DNA sequences are aligned in such a way that maximum overlap can occur between them, so that we get maximum matching of characters which will be treated as our seeds of the alignment. The proposed algorithm will first find the seeds in the aligning sequences and then it will grow the alignment on the basis of statistical approach of curve fitting using standard deviation. Keywords: Multiple Sequence Alignment, Sequence Alignment, Word Method, Statistically Optimized Algorithm, Comparative Genome Analysis, Cross Referencing, Evolutionary Relationship. 1 Introduction Multiple sequence alignment is a crucial prerequisite for biological sequence data analysis. It is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. A large number of multi-alignment programs have been developed during last twenty years. There are three main considerations in choosing a program: biological accuracy, execution time and memory usage. Biologi- cal accuracy is generally the most important concern amongst all. Some of the promi- nent and accurate programs according to most benchmarks are CLUSTAL W [1], DI- ALIGN [2], T-COFFEE [3], MAFFT, MUSCLE, PROBCONS . An overview about these tools and other established methods are given [4]. T-COFFEE is a prototypical consistency- based method which is still considered as one of the most accurate program available. MAFFT and MUSCLE have a similar design, building on work done by Gotoh in the 1990s that culminated in the PRRN * Corresponding author.