Algorithm and Data Structures for Eﬃcient Energy Maintenance during Monte Carlo Simulation of Proteins Itay Lotan * Fabian Schwarzer * Dan Halperin † Jean-Claude Latombe * March 14, 2003 Abstract Monte Carlo simulation (MCS) is a common methodology to compute pathways and thermodynamic properties of proteins. A simulation run is a series of random steps in conformation space, each perturbing some degrees of freedom of the molecule. A step is accepted with a probability that depends on the change in value of an energy function. Typical energy functions sum many terms. The most costly ones to compute are contributed by atom pairs closer than some cutoﬀ distance. This paper introduces a new method that speeds up MCS by exploiting the facts that proteins are long kinematic chains and that few degrees of freedom are changed at each step. A novel data structure, called the ChainTree, captures both the kinematics and the shape of a protein at successive levels of detail. It is used to eﬃciently detect self-collision (steric clash between atoms) and/or ﬁnd all atom pairs contributing to the energy. It also makes it possible to identify partial energy sums left unchanged by a perturbation, thus allowing the energy value to be incrementally updated. Computational tests on four proteins of sizes ranging from 68 to 755 amino acids show that MCS with the ChainTree method is signiﬁcantly faster (as much as 10 times faster for the largest protein) than with the widely used grid method, though the latter is asymptotically optimal in the worst case. They also indicate that speed-up increases with larger proteins. 1 Introduction 1.1 Monte Carlo simulation (MCS) The study of the conformations adopted by proteins is an important topic in structural biology. MCS [6] is one common methodology for this study. In this context, it has been used for two purposes: (1) estimating thermodynamic quantities over a protein’s conformation space [29, 39, 63] and, in some cases, even kinetic properties [50, 51]; and (2) searching for low-energy conformations of a protein, including its native structure [2, 3, 64]. The approach was originally proposed in [43], but many variants and improvements have later been suggested [28]. MCS is a series of randomly generated trial steps in the conformation space of the studied molecule. Each such step consists of perturbing some degrees of freedom (DOFs) of the molecule [40, 50, 51, 63, 64], in general torsion (dihedral) angles around bonds (see Section 1.2). Classically, a trial step is accepted – i.e., the simulation actually moves to the new conformation – with probability min{1,e −ΔE/k b T } (the so-called Metropolis criterion [43]), where E is an energy function deﬁned over the conformation space, ΔE is the diﬀerence in energy between the new and previous conformations, k b is the Boltzmann constant, and T is the temperature of the system. So, a downhill step to a lower-energy conformation is always accepted, while an uphill step is accepted with a probability that goes to zero as the energy barrier grows large. It has been shown that a long MCS with the Metropolis criterion and an appropriate step generator produces a distribution of accepted conformations that converges to the Boltzmann distribution. Molecular Dynamics simulation (MDS) is another common approach for studying protein conformations. The forces on the atoms are then computed at each step, and used to calculate the atom positions at the next step. To be physically accurate, MDS proceeds by small time steps (usually on the order of a few femto-seconds), resulting in slow progress through conformation space. Most MDS techniques update the Cartesian coordinates of the atoms at each step, but recently there has been growing interest in directly updating torsion angles [24, 55], as this allows * Department of Computer Science, Stanford University, Stanford, CA 94305, USA. E-mail: [itayl, schwarzf, latombe]@cs.stanford.edu. † School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel. E-mail: danha@post.tau.ac.il 1