Analysis of a Data Set of Paired Uncomplexed Protein Structures: New Metrics for Side-Chain Flexibility and Model Evaluation Shanrong Zhao, David S. Goodsell, and Arthur J. Olson * Department of Molecular Biology, Scripps Research Institute, La Jolla, California ABSTRACT We compiled and analyzed a data set of paired protein structures containing proteins for which multiple high-quality uncomplexed atomic structures were available in the Protein Data Bank. Side-chain flexibility was quantified, yielding a set of residue- and environment-specific confidence lev- els describing the range of motion around 1 and 2 angles. As expected, buried residues were inflexible, adopting similar conformations in different crystal structure analyses. Ile, Thr, Asn, Asp, and the large aromatics also showed limited flexibility when ex- posed on the protein surface, whereas exposed Ser, Lys, Arg, Met, Gln, and Glu residues were very flexible. This information is different from and complementary to the information available from rotamer surveys. The confidence levels are useful for assessing the significance of observed side-chain motion and estimating the extent of side-chain mo- tion in protein structure prediction. We compare the performance of a simple 40° threshold with these quantitative confidence levels in a critical evaluation of side-chain prediction with the pro- gram SCWRL. Proteins 2001;43:271–279. © 2001 Wiley-Liss, Inc. Key words: side-chain flexibility; protein structure prediction INTRODUCTION Proteins combine structural rigidity with local flexibil- ity. Most natural proteins adopt a defined folded structure, with secondary-structure segments arranged in a defined geometry. Layered on top of this relatively rigid core are several levels of flexibility: occasionally, the motion of entire domains alters the entire shape of the protein; often, the motion of connecting loops and terminal extensions modifies the shape of a cleft or extension; and in all proteins, side-chain motion alters the local topography. 1 Side-chain conformation is determined by the intrinsic torsional flexibility of each residue, which is then limited by a combination of external factors: steric contacts with the local peptide backbone, interactions with neighboring parts of the protein, and interactions with surrounding proteins and solvents. Most analyses of side-chain conformation study the range of motion available to a given residue type, but they do not analyze the flexibility of a given residue within a given protein environment. In a typical study, a database of representative structures is chosen from the Protein Data Bank (PDB), and the range of conformations is tabulated for each type of residue. For 1 angles, this yields the familiar three-peaked histograms, showing that amino acids generally prefer the three staggered conforma- tions [Fig. 1(A)]. These histograms may be used to gener- ate rotamer libraries for protein structure prediction by picking a representative set of conformations that will cover most of the commonly observed (and, therefore, energetically favored) ranges. These analyses, however, do not yield information on the flexibility of a given residue within a protein. All residues, whether buried or exposed, are surrounded by other residues, limiting their range of motion. Some positions will allow motion between different rotameric states, but other positions with stronger restraints will not allow such flexibility. Because only a single structure of each protein is included in rotamer surveys, location-specific interac- tions tend to average out, and the results reflect primarily the steric contacts with the main chain of adjacent resi- dues, which are consistent across the entire test set. Rotamer analyses reveal the most energetically favorable conformations when observed in all environments, but a different approach must be taken to determine the flexibil- ity of individual residues within the environment of a given protein. Instead of surveying a single representative of each pro- tein, we compared several different structures of each pro- tein, looking for differences in side-chain conformation among the different structure solutions. In this way, we could look at each position, such as Arg14 in lysozyme, individually, determining its range of motion and the effect of the local environment on this motion. In this article, we report quanti- tative values describing the ranges of amino acid flexibility observed in uncomplexed protein structures. This informa- tion has important implications for the design and evalua- tion of protein prediction methods. Manuscript 13192-MB from the Scripps Research Institute. Grant sponsor: National Institutes of Health; Grant number: PO1 HL16411. *Correspondence to: Arthur Olson, Department of Molecular Biol- ogy, Scripps Research Institute, 10550 N. Torrey Pines Road, La Jolla, CA 92037. E-mail: olson@scripps.edu Received 5 September 2000; Accepted 22 January 2001 Published online 00 Month 2001 PROTEINS: Structure, Function, and Genetics 43:271–279 (2001) © 2001 WILEY-LISS, INC.