Describing Protein Folding Kinetics by Molecular Dynamics Simulations. 1. Theory ² William C. Swope* and Jed W. Pitera IBM Almaden Research Center, 650 Harry Road, San Jose, California 95120 Frank Suits IBM Watson Research Center, Route 134, Yorktown Heights, New York 10598 ReceiVed: NoVember 10, 2003; In Final Form: February 21, 2004 A rigorous formalism for the extraction of state-to-state transition functions from a Boltzmann-weighted ensemble of microcanonical molecular dynamics simulations has been developed as a way to study the kinetics of protein folding in the context of a Markov chain. Analysis of these transition functions for signatures of Markovian behavior is described. The method has been applied to an example problem that is based on an underlying Markov process. The example problem shows that when an instance of the process is analyzed under the assumption that the underlying states have been aggregated into macrostates, a procedure known as lumping, the resulting chain appears to have been produced by a non-Markovian process when viewed at high temporal resolution. However, when viewed on longer time scales, and for appropriately lumped macrostates, Markovian behavior can be recovered. The potential for extracting the long time scale behavior of the folding process from a large number of short, independent molecular dynamics simulations is also explored. 1. Introduction An understanding of the mechanisms by which proteins fold would have wide utility in many areas, ranging from the development of effective treatments for protein folding related diseases to exploitation of the underlying principles of folding to facilitate industrial nanotechnology. The study of protein folding has three aspects: thermodynamics, kinetics, and structure prediction. In this work we introduce an approach to characterizing some aspects of protein folding kinetics and apply it to a simple example problem. In a companion paper, 1 we apply the approach to the folding of a small peptide, the C-terminal -hairpin motif from protein G. Protein folding has been extensively studied experimentally 2-6 and by computer simulation. 7-12 Computer simulations can provide information about the process that is highly comple- mentary to that obtained from experiment. 8,13-17 Furthermore, the computer power available for biomolecular simulations in general, and protein folding in particular, is increasing through the production of improved software to exploit parallelism, 18 specialized hardware, 19 larger and faster computer systems and grid and distributed computing approaches. 20-23 Indeed, the IBM BlueGene project, 24-27 to build a massively parallel computer to investigate biomolecular processes such as protein folding, is expected to systematically study a variety of peptide and small protein systems and will produce very large volumes of simulation data. One significant advantage of this greater computer power is that the field is moving from studies that report on single events observed during single trajectories of limited duration, 7 to studies where extensive thermodynamic sampling has been performed 11-13,28-30 and ensembles of trajectories are produced and analyzed. 8,9 Obtaining large numbers of independent trajectories is not only a very effective way to use parallel computing technologies but is required for statistically meaningful and reproducible results. 31 Because of this move to more comprehensive simulations, new and autom- atable analysis procedures that can be applied consistently to data from simulations of a variety of protein systems need to be developed and validated. Protein folding is generally studied in the liquid phase, where the protein or peptide is in contact with a solvent. Besides providing part of the driving force for the folding process, through hydrophobic and hydrophilic hydration, the solvent also provides friction and a heat bath for the process. In fact, because of the random forces exerted by the solvent, one would expect that if several identical peptides could be prepared in the same conformation and solvated, they would very likely adopt different folding trajectories, perhaps following completely different paths and taking different amounts of time to reach native conformations. It is because of this stochastic nature of folding that one should be careful not to draw strong conclusions about the process if they are deduced from single MD trajectories. But given that hundreds of protein simulation trajectories can be produced, what is the best way to use them to understand the process of folding? One possible approach, explored in this work, is to analyze the trajectories to produce a probability for the evolution of the protein from one conformational state to another. The formalism associated with Markov processes and models is, therefore, a natural approach for this analysis. Markov models of stochastic processes deal with the temporal evolution of the state of a system. They are appropriate when the memory of the system is short. That is, when the evolution of the system into the near future depends only its properties at the current time, and not on any of its prior history. Markov models can be of several types depending on whether one discretizes the time domain, the state space, or both. With a discrete time Markov chain, both the time and space domain ² Part of the special issue “Hans C. Andersen Festschrift”. * To whom correspondence should be addressed. 6571 J. Phys. Chem. B 2004, 108, 6571-6581 10.1021/jp037421y CCC: $27.50 © 2004 American Chemical Society Published on Web 04/22/2004