Protein Fold Recognition Score Functions: Unusual Construction Strategies Daniel J. Ayers, 1 Thomas Huber, 2 and Andrew E. Torda 1 * 1 Research School of Chemistry, Australian National University, Canberra, Australia 2 ANU Supercomputer Facility, Australian National University, Canberra, Australia ABSTRACT We describe two ways of optimiz- ing score functions for protein sequence to struc- ture threading. The first method adjusts parameters to improve sequence to structure alignment. The second adjusts parameters so as to improve a score function’s ability to rank alignments calculated in the first score function. Unlike those functions known as knowledge-based force fields, the result- ing parameter sets do not rely on Boltzmann statis- tics, have no claim to representing free energies and are purely constructions for recognizing protein folds. The methods give a small improvement, but suggest that functions can be profitably optimized for very specific aspects of protein fold recognition. Proteins 1999;36:454–461. 1999 Wiley-Liss, Inc. Key words: force field; optimization; protein struc- ture; structure prediction; knowledge- based protein structure prediction; fold recognition; threading INTRODUCTION For a molecular mechanics calculation, it is common to try to use a model that directly reflects nature. Such a model should work under a range of conditions and be transferable from system to system. If, however, one is only interested in a single specific property, it may not be necessary to chase the perfect force field and parameters. It may be easier to find some score function specialized for this property. This work is based on this idea and the goal of finding functions for protein fold recognition/threading even if the functions do not reflect real energies or other molecular mechanics properties. This goal leads to distinct differences between special purpose scoring functions and more general force fields. In a simple scoring function, it is not necessary to closely mimic the laws of physics. Instead, interactions may be represented by some set of functions which are easy to parameterize. For example, smoothed contact functions are a gross simplification of nature, but may be easy to optimize for protein sequence to structure threading. 1 Next, there will be differences in how one chooses the parameters that characterize the interactions. In an atom- istic force field, properties such as bonds lengths, angles, and so on could just be taken from experiment or some higher level calculation. 2–5 For a score function for protein sequence-structure threading, there are several ap- proaches one could take. One could assume that protein structures follow a Boltzmann distribution and calculate a potential of mean force from known protein structures. 6–8 Another school of thought is that one should simply try to discriminate good sequence-structure pairs from unlikely ones. 9 This is usually done by optimizing a score function’s ability to distinguish ideal sequence-structure pairs from some set of incorrect (misfolded) sequence-structure pairs. 1,10–14 This work continues along these lines where one first defines a measure of score function quality and then adjusts parameters so as to maximize the quality function using some set of training proteins. This idea is extended by splitting the task of protein sequence to structure threading into two sub-problems with a special parameter set for each. It is shown how one might optimize one score function for protein sequence to structure alignment and a totally separate function for ranking the calculated align- ments. The rationale for this is that different score functions work best in different problem domains 15 and it is clear that sequence to structure alignment is a different prob- lem to ranking of aligned structures. In the first step (alignment), there is a sequence of interest which has to be aligned to each member of a library of 10 2 to 10 3 candidate or decoy structures. In the second step (ranking), one wants to rank the 10 2 to 10 3 generated structures accord- ing to which is closest to the (unknown) native structure. During alignment calculation, the set of decoys is very large since one should allow for a gap in either the sequence or template of any length and at any position. Formally, this means that the searching problem is NP- complete. 16 Physically, this means that the set of decoys is not limited to compact, protein-like structures since it implicitly includes the astronomical set of wrong align- ments and, conceptually, structures with additional or missing residues. In the second phase, one has a small set of just 10 2 to 10 3 alignments which have to be ranked. The set of alternative/decoy structures is a set of (hopefully) optimal alignments on non-optimal templates. Score functions for both phases were parameterized by building a penalty function that operated on score function parameters. Although the score functions were continu- ous, the penalty functions for optimizing them were not. *Correspondence to: Andrew E. Torda, Research School of Chemis- try, Australian National University, Canberra ACT 0200, Australia. E-mail:Andrew.Torda@anu.edu.au Received 19 January 1999; Accepted 3 May 1999 PROTEINS: Structure, Function, and Genetics 36:454–461 (1999) 1999 WILEY-LISS, INC.