The Dependence of Amino Acid Pair Correlations on Structural Environment Adrian P. Cootes, 1 Paul M.G. Curmi, 2 Ross Cunningham, 3 Christine Donnelly, 3 and Andrew E. Torda 1 * 1 Research School of Chemistry, The Australian National University, Canberra, Australia 2 Initiative in Biomolecular Structure, School of Physics, The University of New South Wales, Sydney, Australia 3 Statistical Consulting Unit, The Australian National University, Canberra, Australia ABSTRACT A statistical analysis was per- formed to determine to what extent an amino acid determines the identity of its neighbors and to what extent this is determined by the structural environment. Log-linear analysis was used to discriminate chance occurrence from statistically meaningful correlations. The classification of structures was arbitrary, but was also tested for significance. A list of statisti- cally significant interaction types was selected and then ranked according to apparent impor- tance for applications such as protein design. This showed that, in general, nonlocal, through- space interactions were more important than those between residues near in the protein sequence. The highest ranked nonlocal interac- tions involved residues in -sheet structures. Of the local interactions, those between resi- dues i and i 2 were the most important in both -helices and -strands. Some surprisingly strong correlations were discovered within -sheets between residues and sites sequen- tially near to their bridging partners. The re- sults have a clear bearing on protein engineer- ing studies, but also have implications for the construction of knowledge-based force fields. Proteins 32:175–189, 1998. 1998 Wiley-Liss, Inc. Key words: pairwise statistics; secondary structure; nonlocal interactions INTRODUCTION It is still not yet known how a protein sequence determines its own fold. Consequently, attempts to design sequences to fold to a specified structure have only shown promise recently. 1 Central to solving these problems is the need to determine which intramolecular interactions make the largest contri- butions to the specificity of a sequence for its native conformation. Theoretical analyses of lattice mod- els 2–5 and experiments 6–9 have suggested that nonlo- cal interactions generally make a greater contribu- tion to a sequence’s structural specificity than do local interactions, although there is some evidence to the contrary, 10 at least when considering events at the protein surface. 11,12 However, a more detailed determination of those interactions of greatest signifi- cance to real proteins is necessary if the important problems of fold recognition and sequence design are to be solved. The aim of this paper is to rank the types of amino acid pairwise interaction in order of importance via a statistical analysis of the protein structure database. Viewed anthropomorphically, it can be asked to what extent does an amino acid determine its neighbor and to what extent does the resulting pair determine its environment class. This approach is physically naive, but it should avoid many assumptions about what are the most important contributions to protein composition. Statistical analysis of proteins has a long history. Structures have been studied extensively for residue propensities in various physical environments 13–15 and for significant factors in amino acid substitu- tions in structural homologs. 16–18 However, there has been relatively little statistical analysis for signifi- cant factors in pairwise interactions. 19–22,40 We applied log-linear analysis 23–26 of pairwise amino acid statistics to determine both their signifi- cance and relative dependence on structural environ- ment. This is a more general approach than is used for knowledge-based force fields and does not rely on statistical mechanics for its derivation. 41–42 Log- linear analysis determines whether variables are dependent on each other by constructing a model that assumes independence of those variables and assesses the fit of that model to the data. The discrepancy of the model from the data is quantified by a measure, termed the ‘‘mean deviance,’’ which is calculated from a 2 distributed, log-likelihood ratio statistic. The validity of a classification of pairwise amino acid interactions with respect to a series of struc- tural variables, such as secondary structure, can be tested using log-linear analysis and quantified by a mean deviance statistic ( D 123 ). This should yield a final classification of apparently independent interac- tion classes. *Correspondence to: Andrew Torda, Research School of Chem- istry, The Australian National University, Canberra ACT 0200, Australia. E-mail:Andrew.Torda@anu.edu.au Received 7 November 1997; Accepted 16 March 1998 PROTEINS: Structure, Function, and Genetics 32:175–189 (1998) 1998 WILEY-LISS, INC.