proteins STRUCTURE O FUNCTION O BIOINFORMATICS Evolutionary information hidden in a single protein structure Chien-Hua Shih, Chih-Min Chang, Yeong-Shin Lin, Wei-Cheng Lo, and Jenn-Kang Hwang * Institute of Bioinformatics, National Chiao Tung University, HsinChu 30050, Taiwan, Republic of China INTRODUCTION Functionally and structurally important amino acids can be deduced from their level of conservation in families of homologous proteins. The conserved amino acids are usually involved in enzyme activity, ligand binding or protein–protein interac- tions, or are buried in the protein cores. 1 A single sequence is unable to convey the wealth of evolutionary information regarding conservation. Determining the level of conservation 2–5 at each amino acid site requires aligning families of homologous sequences and considering factors like amino acid occurrence frequency, stereochemi- cal, or physicochemical properties, substitution matrices, phylogenetic trees, and the probabilistic models underlying the evolution. It is well observed that families of homologous sequences usually share common three-dimensional folds. 1 Indeed, many successful homology modeling methods 6–8 are based on this observation. Therefore, it is expected that protein structures should contain common evolutionary information shared by their homologous sequences. Recent studies show that the protein structure is more than a mere scaffold for posi- tioning residues. It has been shown that B-factors (or atomic mean-square displace- ments), 9–12 motional correlations in structure, 9,11,12 and the locations of catalytic residues 13–15 can be derived directly from the atomic coordinates of protein back- bones without any additional assumptions about the protein models. Here we report that evolutionary information regarding conservation at the resi- due level can be quantitatively extracted from a single structure. We show that gener- ally, the sequence conservation profiles closely resemble those of packing density of the structures. Our results indicate that protein structure exerts such strong con- straints on the evolvability of each residue that the profile of sequence conservation essentially reflects that of the structure. RESULTS Comparison of the weighted contact number and the conservation profiles The weighted contact number (WCN) is the number of contact atoms at an amino acid site, weighted by the inverse square separation between residues repre- sented (see METHODS). The WCN basically describes the packing density of a pro- tein structure. The larger the WCN of a residue is, the more packed its environment. The conservation score of a protein is based on the evolutionary rate of each residue, computed using the evolutionary relations among homologous sequences and the Additional Supporting Information may be found in the online version of this article. Grant sponsors: National Science Council, The MoE ATU Program, Taiwan, R.O.C. Chien-Hua Shih and Chih-Min Chang contribute equally to this paper. *Correspondence to: Dr. Jenn-Kang Hwang, Institute of Bioinformatics, National Chiao Tung University, Hsin Chu 30050, Taiwan, R.O.C. E-mail: jkhwang@faculty.nctu.edu.tw Received 22 August 2011; Revised 7 February 2012; Accepted 12 February 2012 Published online 20 February 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/prot.24058 ABSTRACT The knowledge of conserved sequences in proteins is valua- ble in identifying functionally or structurally important resi- dues. Generating the conserva- tion profile of a sequence requires aligning families of ho- mologous sequences and having knowledge of their evolutionary relationships. Here, we report that the conservation profile at the residue level can be quanti- tatively derived from a single protein structure with only backbone information. We found that the reciprocal pack- ing density profiles of protein structures closely resemble their sequence conservation profiles. For a set of 554 non- homologous enzymes, 74% (408/554) of the proteins have a correlation coefficient > 0.5 between these two profiles. Our results indicate that the three- dimensional structure, instead of being a mere scaffold for positioning amino acid resi- dues, exerts such strong evolu- tionary constraints on the resi- dues of the protein that its pro- file of sequence conservation essentially reflects that of its structural characteristics. Proteins 2012; 80:1647–1657. V V C 2012 Wiley Periodicals, Inc. Key words: protein structure; sequence conservation; contact number; evolution; B-factors. V V C 2012 WILEY PERIODICALS, INC. PROTEINS 1647