PROTEINS zyxwvu Structure, Function, and Genetics 18:254-261 (1994) zy An Improved Pair Potential to Recognize Native Protein Folds Aron Bauer and Anton Beyer Research Institute of Molecular Pathology, A-1 030 Vienna, Austria zyxwv ABSTRACT We present a novel method to improve a simple pair potential of mean force, derived from experimentally determined pro- tein structures, in such a way that it recognizes native protein folds with high reliability. This improvement is based on the use of mutation data matrices to overcome difficulties arising from the poor statistics of small sample sizes. A set of zyxwvuts 167 protein chains taken from the Brookhaven Protein Structure Data Base, se- lected from high-resolution structures and avoiding homologous proteins, is used for gen- eration of the potential set. The potential de- scribes interresidue pair energies depending on distance and sequential separation, and is cal- culated using the Boltzmann equation. Its per- formance is evaluated by jackknife tests that try to identify the native fold for a given se- quence among a large number of possible threadings on all structures in the set without allowing for gaps. Up to 94% of the protein chains are correctly assigned to their native folds, so that all proper single-chain domains are recognized. 0 1mWiley-Liss, I~C. Key words: Boltzmann equation, pair poten- tial, mutation data matrix, jack- knife test, protein fold recognition, threading INTRODUCTION Statistically derived potentials provide a novel way to assess the stability of a given fold for a given amino acid sequence and may therefore be used for protein fold recognition. Quite a number of ap- proaches in this direction have been undertaken (for reviews see 1 and 2), but it is still unclear what type of potential is suited best for fold recognition, in- verse folding, and general structure prediction. The limited amount of structural data seems to be the biggest problem for any reliable statistics on protein structures, especially when the dataset is further reduced by applying constraints on resolution and homology. Detailed statistics on protein structures will always face these difficulties, unless vast amounts of structural data are available. To over- come the problem of missing data, smoothing proce- dures have to be devised. One way of improving statistics is zyxwvu to reduce the zyxwvu 0 1994 WILEY-LISS, INC. number of variables used, either by grouping 20 amino acids into fewer ~ategories,~ or reducing lev- els of sequential separation along the chain?' or considering few distance intervals only.4 Often a pure contact potential, that distinguishes between just two types of residusresidue interactions, is ~sed.~,',~ However, when detailed statistics on pro- teins is done, many infrequent combinations (e.g., rare amino acids) are encountered. Here we present a method that intends to gather additional information on protein structures using putative homologues of a protein of known struc- ture. Homologous proteins contain a wealth of further information, however, including homologues biases the derived potential or statistics toward a few groups of overrepresented protein families. This could be overcome by applying suitable weights, but might turn out to be difficult to accomplish. Seicient information on homology might also be found in mutation matrices, which, basically, give the probabilities for one amino acid type being re- placed by another in related sequences. We consider single measurements of distances as the events that constitute our statistics and, finally, the potential. A mutation matrix may be regarded as giving the probabilities for observing amino acid pairs differ- ent from the current pair associated with the cur- rent measurement. Thus, in principle, the informa- tion gain from the observation of one out of 400 possible amino acid pairs can be utilized for the re- maining 399 pairs also. For example, any observa- tion for a pair Ala-Val might also be useful for cal- culating the potentials for Ala-Ile, Ser-Val pairs, etc. In our study, we apply matrix substitutions of first order only, i.e., the information for a pair a-b is distributed among 19 pairs a-x and 19 pairs x-b too. Single amino acid substitutes can be found fre- quently in related protein sequences, and hardly al- ter the conformation of a molecule. We assume that statistically derived potentials have physical signif- icance, representing a combination of several types of interactions, including hydr~phobicity,~ electro- Received August 6,1993; revision accepted Odober 26,1993. Address reprint requests to Dr. Anton Beyer, Research In- stitute of Molecular Pathology, Dr. Bohr Gasse 7, A-1030 Vi- enna, Austria.