proteins STRUCTURE O FUNCTION O BIOINFORMATICS Information and discrimination in pairwise contact potentials Armando D. Solis and S. Rackovsky * Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York, New York 10029 INTRODUCTION Numerous contacts occur in folded proteins between amino acid residues that are far apart in sequence. The patterns of these so-called long-range interactions have been utilized in protein structure prediction as a criterion to judge the correct- ness of a given sequence–structure alignment. The scoring scheme is embodied in energy-like empirical potentials, 1–4 derived easily from the database of solved high-resolution X- ray structures. 5 The success of these contact potentials, along with their relative ease of application, has made them an im- portant part of statistical potentials for fold recognition and ab initio prediction. 6–9 The effort to design better folding potentials seeks to unlock as much of the structural information encoded in sequence as possible. It is an axiom of protein folding that all the informa- tion needed to determine the structure of a protein chain is contained in its amino acid sequence. Therefore, the remarkable performance of contact potentials in discriminating between native and non-native conformations implies that pairwise con- tacts between residues contain a significant amount of this in- formation. Understanding the nature of long-range pairwise contact information, both in the way it is encoded in sequence and its ability to discriminate among native-like conformations, is of primary interest for optimizing performance. In this work, we use information-theoretic ideas, developed in previous work, 10 to carry out two critical analyses: first, to quantify the information residing in pairwise contacts derived from known structure, and then to understand the behavior of contact potentials in fold recognition as a function of this information. Two previous studies 11,12 have concluded that the mutual information contained in pairwise contacts, as estimated from observed propensities, are modest at best. The question arises how such a minute amount of information can lead to the proven success of contact potentials in fold recognition. Our work attempts to reconcile these two seemingly divergent observations. We first note that the two studies did not explic- Grant sponsor: National Library of Medicine of the National Institutes of Health; Grant number: LM006789. *Correspondence to: S. Rackovsky, Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, Box 1603, One Gustave L. Levy Place, New York, NY 10029. E-mail: shelly@camelot.mssm.edu Received 27 December 2006; Revised 16 June 2007; Accepted 21 June 2007 Published online 14 November 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21733 ABSTRACT We examine the information-theoretic characteristics of statistical potentials that describe pairwise long-range contacts between amino acid residues in proteins. In our work, we seek to map out an efficient information-based strategy to detect and optimally utilize the structural in- formation latent in empirical data, to make contact poten- tials, and other statistically derived folding potentials, more effective tools in protein structure prediction. Fore- most, we establish fundamental connections between basic information-theoretic quantities (including the ubiqui- tous Z-score) and contact ‘‘energies’’ or scores used rou- tinely in protein structure prediction, and demonstrate that the informatic quantity that mediates fold discrimi- nation is the total divergence. We find that pairwise con- tacts between residues bear a moderate amount of fold in- formation, and if optimized, can assist in the discrimina- tion of native conformations from large ensembles of native-like decoys. Using an extensive battery of threading tests, we demonstrate that parameters that affect the infor- mation content of contact potentials (e.g., choice of atoms to define residue location and the cut-off distance between pairs) have a significant influence in their performance in fold recognition. We conclude that potentials that have been optimized for mutual information and that have high number of score events per sequence–structure align- ment are superior in identifying the correct fold. We derive the quantity ‘‘information product’’ that embodies these two critical factors. We demonstrate that the information product, which does not require explicit threading to com- pute, is as effective as the Z-score, which requires expensive decoy threading to evaluate. This new objective function may be able to speed up the multidimensional parameter search for better statistical potentials. Lastly, by demon- strating the functional equivalence of quasi-chemically approximated ‘‘energies’’ to fundamental informatic quantities, we make statistical potentials less dependent on theoretically tenuous biophysical formalisms and more amenable to direct bioinformatic optimization. Proteins 2008; 71:1071–1087. V V C 2007 Wiley-Liss, Inc. Key words: pairwise contact potentials; empirical potentials; statistical potentials; information theory; protein fold recognition; threading; divergence. V V C 2007 WILEY-LISS, INC. PROTEINS 1071