Aligning Sequences with Non-Affine Gap Penalty: PLAINS Algorithm, a Practical Implementation, and its Biological Applications in Comparative Genomics Ofer Gill 1, 4 , Yi Zhou 3 and Bud Mishra 1, 2 1 Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY, USA 10012. 2 Department of Cell Biology, NYU School of Medicine, 550 First Avenue, New York, NY 10016. 3 Department of Biology, New York University, 100 Washington Squre East, New York, NY 10003. 4 Corresponding Author: gill@cs.nyu.edu; Phone: 212-998-3351 ABSTRACT In this paper, we consider PLAINS, an algorithm that provides efficient alignment over DNA sequences using piecewise-linear gap penalties that closely approximate more general and meaningful gap-functions. The innovations of PLAINS are fourfold. First, when the number of parts to a piecewise-linear gap function is fixed, PLAINS uses linear space in the worst case, and obtains an alignment that is provably correct under its memory constraints, and thus has an asymptotic com- plexity similar to the currently best implementations of Smith-Waterman. Second, we score alignments in PLAINS based on important segment pairs; optimize gap parame- ters based on interspecies alignments, and thus, identify more significant correlations in comparison to other similar algorithms. Third, we describe a practical implemen- tation of PLAINS in the Valis multi-scripting environment with powerful and intu- itive visualization interfaces, which allows users to view the alignments with a natural multiple-scale color grid scheme. Fourth, and most importantly, we have evaluated the biological utility of PLAINS using extensive lab results; we report the result of com- paring a human sequence to a fugu sequence, where PLAINS was capable of finding more orthologous exon correlations than similar alignment tools. 1 I NTRODUCTION To a rough approximation, DNA sequence alignment problem differs marginally from pro- tein sequence alignment problem. (For instance, at a superficial level, one may note that DNA alignment is over an alphabet of 4 letters whereas protein alignment is over an alpha- bet of 20 letters). However, two key differences are that (1) there are 3 bp DNA code per amino acid, and that (2) genes in DNA sequences that ultimately get transcripted and trans- lated into proteins can be separated by intergenic regions of few thousands of base pairs that do not get expressed, and perhaps, are subject to strikingly different (or no) selection constraints. Thus these intergenic regions typically vary to a greater extent in one species compared to another. Therefore, we may expect the gap lengths in DNA alignments to be larger, more variable, and have specie-specific distributions. Moreover, these distributions characterizing the gap-lengths may not be memory-less (i.e., exponential distributions). There have been suggestions that power-law distributions may be more appropriate. The evolutionary processes governing the genomes of species, and the log-likelihood of certain indel gaps occurring when comparing one species against another suggest that a logarithmic