Scoring Residue Conservation William S.J. Valdar * Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom ABSTRACT The importance of a residue for maintaining the structure and function of a protein can usually be inferred from how conserved it ap- pears in a multiple sequence alignment of that protein and its homologues. A reliable metric for quantifying residue conservation is desirable. Over the last two decades many such scores have been proposed, but none has emerged as a generally accepted standard. This work surveys the range of scores that biologists, biochemists, and, more re- cently, bioinformatics workers have developed, and reviews the intrinsic problems associated with devel- oping and evaluating such a score. A general for- mula is proposed that may be used to compare the properties of different particular conservation scores or as a measure of conservation in its own right. Proteins 2002;48:227–241. © 2002 Wiley-Liss, Inc. Key words: protein sequence analysis; amino acid; variability; evolutionary conservation; multiple sequence alignment INTRODUCTION A multiple-sequence alignment is a historical record. The patterns of amino acid variability in its columns tell a story of evolutionary pressure, mutation, recombination, and genetic drift that often spans many millions of years. This story can be read in different ways. According to the neutral model of molecular evolution, once a protein has evolved to a useful level of functionality, most new muta- tions are either deleterious, in which case they are re- moved by negative selection, or neutral, in which case they are kept. Therefore, most of the substitutions observed in an alignment are neutral; rather than representing im- provements in a protein, they indicate how tolerant the protein is to change at that position. In an already optimized protein, the rate of substitution will be inversely correlated with the functional constraints acting on that protein. The most functionally important residues of hemo- globin, those that secure the heme group, show a much lower rate of substitution than do others in the protein. The selectionist model of molecular evolution, although agreeing that most mutations are deleterious and re- moved, argues that accepted mutations usually confer a selective advantage, whereas neutral mutations are rare (Ref. 1 and refs. therein). Although both models have their place, this review takes the perspective of the neutral model only. That model accords better with the idea of conservation among functionally equivalent sequences and is arguably the more evident in alignments from structural biology. If the degree of functional constraint dictates how conserved a position is, then the converse must also be true, that is, the degree of conservation must indicate the functional importance of that position. Thus, identifying conserved regions of a protein is tremendously useful. In the past, patterns of conservation in multiple alignments were identified by inspection alone. However, the rapid increase of available sequences and published analyses has emphasized the need for objective, automated meth- ods, and in the last decade or so, this has been the subject of considerable research. Much of that work has focused on extracting global patterns and motifs from multiple align- ments, often with a view to exploring the relationships between homologues and developing diagnostic tests for functions of newly discovered sequences. For instance, statistically robust profile methods, such as PSI-BLAST 2 and those based on hidden Markov models, 3 have become increasingly popular. Despite these advances, there have been few recent insights into the derivation of a quantitative conservation measure for a single aligned position, and there certainly is no standard method. Ask a life scientist how similar two sequences are and he will probably quote a percentage identity or an E-value. Ask him how conserved a position is in a family and the reply is most likely to be qualitative. This review discusses what a quantitative measure of conservation should actually measure and, by surveying almost 20 scores, examines some of the problems inherent in developing such a score. Exercises for a Conservation Score There is no rigorous mathematical test for judging a conservation measure; if there were, one would use the test and not bother with an additional score. Rather than accuracy then, a conservation score may be judged on its verisimilitude: its ability to depict realism and its concor- dance with biochemical intuition. Figure 1 helps make these abstract notions more concrete. It shows columns of amino acids taken from hypothetical multiple-sequence alignments of functionally equivalent orthologues. For simplicity, we assume each sequence contributes equally *Correspondence to: William S.J. Valdar, Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. E-mail: William.Valdar@well.ox.ac.uk Received 12 October 2001; Accepted 22 February 2002 Published online 00 Month 2002 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.10146 PROTEINS: Structure, Function, and Genetics 48:227–241 (2002) © 2002 WILEY-LISS, INC.