Scoring Residue Conservation
William S.J. Valdar
*
Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London,
London, United Kingdom
ABSTRACT The importance of a residue for
maintaining the structure and function of a protein
can usually be inferred from how conserved it ap-
pears in a multiple sequence alignment of that
protein and its homologues. A reliable metric for
quantifying residue conservation is desirable. Over
the last two decades many such scores have been
proposed, but none has emerged as a generally
accepted standard. This work surveys the range of
scores that biologists, biochemists, and, more re-
cently, bioinformatics workers have developed, and
reviews the intrinsic problems associated with devel-
oping and evaluating such a score. A general for-
mula is proposed that may be used to compare the
properties of different particular conservation scores
or as a measure of conservation in its own right.
Proteins 2002;48:227–241. © 2002 Wiley-Liss, Inc.
Key words: protein sequence analysis; amino acid;
variability; evolutionary conservation;
multiple sequence alignment
INTRODUCTION
A multiple-sequence alignment is a historical record.
The patterns of amino acid variability in its columns tell a
story of evolutionary pressure, mutation, recombination,
and genetic drift that often spans many millions of years.
This story can be read in different ways. According to the
neutral model of molecular evolution, once a protein has
evolved to a useful level of functionality, most new muta-
tions are either deleterious, in which case they are re-
moved by negative selection, or neutral, in which case they
are kept. Therefore, most of the substitutions observed in
an alignment are neutral; rather than representing im-
provements in a protein, they indicate how tolerant the
protein is to change at that position. In an already
optimized protein, the rate of substitution will be inversely
correlated with the functional constraints acting on that
protein. The most functionally important residues of hemo-
globin, those that secure the heme group, show a much
lower rate of substitution than do others in the protein.
The selectionist model of molecular evolution, although
agreeing that most mutations are deleterious and re-
moved, argues that accepted mutations usually confer a
selective advantage, whereas neutral mutations are rare
(Ref. 1 and refs. therein). Although both models have their
place, this review takes the perspective of the neutral
model only. That model accords better with the idea of
conservation among functionally equivalent sequences
and is arguably the more evident in alignments from
structural biology.
If the degree of functional constraint dictates how
conserved a position is, then the converse must also be
true, that is, the degree of conservation must indicate the
functional importance of that position. Thus, identifying
conserved regions of a protein is tremendously useful. In
the past, patterns of conservation in multiple alignments
were identified by inspection alone. However, the rapid
increase of available sequences and published analyses
has emphasized the need for objective, automated meth-
ods, and in the last decade or so, this has been the subject
of considerable research. Much of that work has focused on
extracting global patterns and motifs from multiple align-
ments, often with a view to exploring the relationships
between homologues and developing diagnostic tests for
functions of newly discovered sequences. For instance,
statistically robust profile methods, such as PSI-BLAST
2
and those based on hidden Markov models,
3
have become
increasingly popular.
Despite these advances, there have been few recent
insights into the derivation of a quantitative conservation
measure for a single aligned position, and there certainly
is no standard method. Ask a life scientist how similar two
sequences are and he will probably quote a percentage
identity or an E-value. Ask him how conserved a position is
in a family and the reply is most likely to be qualitative.
This review discusses what a quantitative measure of
conservation should actually measure and, by surveying
almost 20 scores, examines some of the problems inherent
in developing such a score.
Exercises for a Conservation Score
There is no rigorous mathematical test for judging a
conservation measure; if there were, one would use the
test and not bother with an additional score. Rather than
accuracy then, a conservation score may be judged on its
verisimilitude: its ability to depict realism and its concor-
dance with biochemical intuition. Figure 1 helps make
these abstract notions more concrete. It shows columns of
amino acids taken from hypothetical multiple-sequence
alignments of functionally equivalent orthologues. For
simplicity, we assume each sequence contributes equally
*Correspondence to: William S.J. Valdar, Wellcome Trust Centre for
Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3
7BN, UK. E-mail: William.Valdar@well.ox.ac.uk
Received 12 October 2001; Accepted 22 February 2002
Published online 00 Month 2002 in Wiley InterScience
(www.interscience.wiley.com). DOI: 10.1002/prot.10146
PROTEINS: Structure, Function, and Genetics 48:227–241 (2002)
© 2002 WILEY-LISS, INC.