BIOINFORMATICS Vol. 18 no. 11 2002 Pages 1500–1507 Empirical determination of effective gap penalties for sequence comparison J. T. Reese and W. R. Pearson Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA Received on January 4, 2002; revised on April 9, 2002; accepted on April 18, 2002 ABSTRACT Motivation: No general theory guides the selection of gap penalties for local sequence alignment. We empirically determined the most effective gap penalties for protein sequence similarity searches with substitution matrices over a range of target evolutionary distances from 20 to 200 Point Accepted Mutations (PAMs). Results: We embedded real and simulated homologs of protein sequences into a database and searched the database to determine the gap penalties that produced the best statistical significance for the distant homologs. The most effective penalty for the first residue in a gap (q + r ) changes as a function of evolutionary distance, while the gap extension penalty for additional residues (r ) does not. For these data, the optimal gap penalties for a given matrix scaled in 1/3 bit units (e.g. BLOSUM50, PAM200) are q = 25 0.1 (target PAM distance), r = 5. Our results provide an empirical basis for selection of gap penalties and demonstrate how optimal gap penalties behave as a function of the target evolutionary distance of the substitu- tion matrix. These gap penalties can improve expectation values by at least one order of magnitude when search- ing with short sequences, and improve the alignment of proteins containing short sequences repeated in tandem. Contact: wrp@virginia.edu INTRODUCTION Sequence similarity searching and sequence alignment programs have become indispensable tools for biologists. These programs are routinely used to identify homologous sequences, to infer the structure and function of proteins and even to analyze patterns in entire genomes or pro- teomes. Alignment algorithms typically allow residues in one sequence to be aligned to a gap in the other sequence in exchange for a penalty. Most algorithms employ an affine gap penalty scheme, in which a penalty q is accessed for the existence of a gap and another usually smaller penalty r is accessed for extending the gap (Waterman et al., 1976; To whom correspondence should be addressed. Gotoh, 1982; Fitch and Smith, 1983). Thus the penalty for the entire gap is q + r k , where k is the gap length. Gap penalties as log-odds ratios Although the statistical behavior of local alignment scores is well understood both for ungapped (Karlin and Altschul, 1990) and gapped (Mott, 1992; Altschul et al., 1997; Pearson, 1998) local alignments, current models provide little guidance in the selection of gap penalties. For very distant relationships, experience suggests that gap penal- ties that are as low as possible, but still produce local alignments between unrelated sequences, are the most effective (Vingron and Waterman, 1994; Pearson, 1995, 1998). For alignments between more closely related sequences, an information theoretic perspective can provide guidance in selecting gap penalties. Altschul (1991) has shown that in the context of local sequence alignment, the values in residue substitution matrices can be considered ‘log-odds ratios’. That is, an entry in the matrix is the ratio of the probability of an amino acid i aligning with amino acid j (q ij ) because they diverged from a common ancestor and the probability of the two amino acids i and j being aligned due to chance ( p i p j ). From this perspective, q i , the probability of an amino acid i being inserted into or deleted from a protein, might be estimated from multiple alignments. However, ( p i p ), the probability of aligning residue i to a gap by chance, is difficult to estimate because the background frequency of gaps ( p ) is unknown and probably changes with different gap penalty values. Consequently, gap penalty parameters have been opti- mized empirically based on their performance in similar- ity searches (Pearson, 1995), accuracy of the alignment generated (Fitch and Smith, 1983; Vogt et al., 1995), or maintenance of expected statistical characteristics for un- related sequences (Vingron and Waterman, 1994). These methods typically have focused on one or a few substi- tution matrices, e.g. BLOSUM50 or PAM250, but the results are often extrapolated to use with all substitu- tion matrices with the same scale (λ, Altschul, 1991). Thus, the gap penalties used by default in the FASTA 1500 c Oxford University Press 2002