Estimating Pairwise Statistical Signiﬁcance of Protein Local Alignments Using a Clustering-Classiﬁcation Approach Based on Amino Acid Composition Ankit Agrawal 1 , Arka Ghosh 2 , and Xiaoqiu Huang 1 1 Department of Computer Science, Iowa State University, 226 Atanasoﬀ Hall, Ames, IA 50011-1041, USA {ankitag,xqhuang}@iastate.edu 2 Department of Statistics, Iowa State University, 303 Snedecor Hall Ames, IA, 50011-1210, USA apghosh@iastate.edu Abstract. A central question in pairwise sequence comparison is as- sessing the statistical signiﬁcance of the alignment. The alignment score distribution is known to follow an extreme value distribution with ana- lytically calculable parameters K and λ for ungapped alignments with one substitution matrix. But no statistical theory is currently available for the gapped case and for alignments using multiple scoring matri- ces, although their score distribution is known to closely follow extreme value distribution and the corresponding parameters can be estimated by simulation. Ideal estimation would require simulation for each sequence pair, which is impractical. In this paper, we present a simple clustering- classiﬁcation approach based on amino acid composition to estimate K and λ for a given sequence pair and scoring scheme, including using mul- tiple parameter sets. The resulting set of K and λ for diﬀerent cluster pairs has large variability even for the same scoring scheme, underscoring the heavy dependence of K and λ on the amino acid composition. The proposed approach in this paper is an attempt to separate the inﬂuence of amino acid composition in estimation of statistical signiﬁcance of pair- wise protein alignments. Experiments and analysis of other approaches to estimate statistical parameters also indicate that the methods used in this work estimate the statistical signiﬁcance with good accuracy. Keywords: Clustering, Classiﬁcation, Pairwise local alignment, Statis- tical signiﬁcance. 1 Introduction Sequence alignment is extremely useful in the analysis of DNA and protein sequences [1]. Sequence alignment forms the basic step of making various high level inferences about the DNA and protein sequences - like homology, ﬁnding protein function, protein structure, deciphering evolutionary relationships, etc. I. M˘ andoiu, R. Sunderraman, and A. Zelikovsky (Eds.): ISBRA 2008, LNBI 4983, pp. 62–73, 2008. c  Springer-Verlag Berlin Heidelberg 2008