PHYSICAL REVIEW E VOLUME 48, NUMBER 3 SEPTEMBER 1993 Sequence-structure relationships in proteins and copolymers Kaizhi Yue and Ken A. Dill Department of Pharmaceutical Chemistry, Box 1204, Uniuersity of California at San Francisco, San Francisco, California 94I43 (Received 14 January 1993) We model proteins as copolymer chains of H (hydrophobic) and P (polar) monomers configured as self-avoiding Aights on three-dimensional simple-cubic lattices. The HH interaction is favorable. The folding problem is to find the "native" conformation(s) (lowest free energy) for an HP sequence. Using geometric proofs for self-avoiding lattice chains, we develop equations relating a monomer sequence to its native structures. These constraint relations can be used for two purposes: (1) to compute a tight lower bound on the free energy of the native state for HP sequences of any length, which is useful for testing conformational search strategies, and (2) to develop a search strategy. In its present implementa- tion, the search strategy finds native states for HP lattice chains up to 36 monomers in length, which is a speedup of 5 15 orders of magnitude over existing brute-force exhaustive-search methods. PACS number(s): 87. 10. + e I. INTRODUCTION The protein-folding problem is the question of how a linear polymer chain, composed of a specific sequence of amino acids, encodes the unique three dimensional struc- ture to which it folds. The relationship between the amino-acid sequence, on the one hand, and the "native" conformation (i.e. , of lowest free energy) of a protein chain, on the other hand, has been explored using simple lattice models [1 4]. In the HP madel [3, 5 7], the 20 different amino acids are assumed to fall into two classes: hydrophobic (H) or polar (P). Chains are configured as self-avoiding walks on two-dimensional square lattices or three-dimensional cubic lattices. HH contacts are favor- able, and are assumed to be the dominant interaction [8], so under strong folding conditions the native conforma- tions are those that have the greatest number of HH con- tacts. For chains that are sufficiently short, the globally optimal states have been found by brute-force exhaustive-computer enumeration [5-7]. The HP model has the following proteinlike features. When the HH sticking energy is small, the chains have an ensemble of open conformations (the "denatured state"), but when sticking is strong, chains with certain sequences of H and P monomers collapse, through a relatively sharp transi- tion [5,9], to a small ensemble of compact states (often only one or two) [5, 6], with cores of H monomers, comprised of about the same distribution of helices and sheets as in the known proteins [10,11). HP lattice pro- teins also resemble real proteins in some mutational [3, 7, 12] and kinetic [13, 14] properties. A virtue of this simple model is that its partition func- tion can be enumerated exactly, but a major problem is that the global optima cannot be found for longer chains on three-dimensional lattices because the computer time for brute-force enumeration is prohibitive and increases exponentially with chain length [6]. To search for native conformations of longer chains in the HP lattice model, O*toole and Panagiotopoulos have developed efficient Monte Carlo search procedures [9], and Unger and Mou- lt have developed a genetic algorithm [15]. One approach to studying three-dimensional chains through exact enumeration involves the use of a some- what different model. The "perturbed homopolymer" model [1, 2] assumes all monomers (H and P) are sufficiently strongly self-attractive that native states are guaranteed to be among those that are maximally com- pact. Energetic differences between H and P monomers are taken to be a small perturbation relative to the strong background attraction of all monomers for each other. The 27-monomer-chain cube has been studied in this model [1, 2]. Exhaustive enumeration is computationally prohibitive for longer chains in three dimensions in either the HP or perturbed homopolymer models. Here we explore a different strategy to find native states for longer chains on three-dimensional lattices in the HP model. We analyze the geometric packing con- straints for models of chains, using the method of discrete geometry [16,17]. We then develop an equation relating an amino-acid sequence to certain features of its compact native confirmations. The search for native states is for- mulated in terms of a search for conformations that have a core of H monomers of minimal surface area. This con- strained optimization is treated at three different levels of increasingly detailed accounting for the chain connectivi- ty and sequence. At each level, we can predict the gen- eral characteristics of the H cores and compute upper bounds on the maximum number of HH contact achiev- able by a given sequence. Such bounds can be used to learn how successful are sampling strategies, such as Monte Carlo, simulated annealing, genetic algorithms, i.e. , how closely they come to finding globally optimal conformations. The constraint relations are also useful in guiding a search to find native conformations. A search program is described. II. THE MODEX. AND DEFINITIONS First, we define some terms. We consider copolymer chains, each consisting of a specific sequence of H and P 1063-651X/93/48(3)/2267(12)/$06. 00 48 2267 1993 The American Physical Society