Greedy Algorithms for Fast Discovery of
Macrostates from Molecular Dynamics
Simulations
Haoyun Feng
Department of Computer Science and Engineering, Notre Dame, US
Email: hfeng@nd.edu
Geoffrey Siwo
Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, US
Email: siwomolbio@gmail.com
Jesus A. Izaguirre, Douglas Thain, and Badi Abdul-Wahid
Department of Computer Science and Engineering, Notre Dame, US
Email: {izaguirr, dthain, cabdulwa}@nd.edu
Abstract—With development of distributed computing
systems, it is possible to significantly accelerate long-term
molecular dynamics simulations by using ensemble
algorithms, such as Markov State Models (MSM) and
Weighted Ensemble (WE). Decomposing the conformational
space of molecule into macrostates is an important step of
both methods. To ensure efficiency and accuracy of
ensemble methods, it is necessary that the macro states are
defined according to certain kinetic properties. Monte Carlo
simulated annealing (MCSA) has been widely applied to
define macro states with optimal metastability of the
dynamical system. This article proposes two greedy
algorithms, G1 and G2, based on different definitions of
local search space to improve efficiency and scalability of
MCSA on distributed computing system. Numerical
experiments are conducted on two biological systems,
alanine dipeptide and WW domain. The numerical
experiments demonstrate that G1 is the most efficient of the
three on a single core machine and distributed computing
system. Sequential version of G2 is the slowest but it gains
the most speed up on distributed computing systems.
Index Terms—molecular dynamics, metastable states,
unsupervised clustering, greedy algorithm.
I. INTRODUCTION
Molecular Dynamics simulation [1] is a numerical tool
for simulating movements of molecules. It generates a
sequence of coordinates representing Brownian motion of
molecules, which is called a trajectory. Biological studies
such as drug design usually require longterm (mili-
seconds) simulations that may cost several years of CPU
clock time. However, with development of distributed
computing systems, it is possible to generate a large
amount of short trajectories in parallel. Recovering long-
term dynamics from a large number of short trajectories
Manscript received October 12, 2014; revised January 5, 2015.
has drawn the attention of many researchers and a
number of techniques have been developed for such
purpose. The two most popular techniques are Markov
State Models (MSM) [2]-[6] and Weighted Ensembles
(WE) [2], [7]-[11].
Partitioning the continuous conformational space into
macrostates plays a key role in both techniques. MSM
calculates implied time scales of the dynamics from the
transition matrix, within which each entry represents
transition probability between two macrostates. The
quality of macrostate definitions affects accuracy of
MSM. On the other hand, WE is a distributed algorithm
that provides efficient sampling on conformational space
and reliable estimation of reaction rate between free
energy wells. Macrostates are defined so that WE
samples the conformational space evenly by maintaining
a constant number of samples in every macrostate.
Efficiency of WE depends on the underlying partition.
A number of studies have been done on the automatic
discovery of macrostates. Chodera proposed to find
clusters for MSM using Perron cluster cluster analysis
(PCCA) and Monte Carlo simulated annealing (MCSA)
[12]. PCCA is efficient in time complexity but has
statistical errors. MCSA takes PCCA result as initial
solution and continues to reduce the statistical errors. The
advantage of MCSA is that it guarantees for finding the
optimal clusters, but the time complexity is high and
unpredictable, especially the sequential version
demonstrated in [12]. Chang and Yao proposed to reduce
statistical errors of PCCA by clustering densely and
sparsely sampled conformational spaces separately, so
that the sparsely sampled space won't affect accuracy on
the overall clustering [13], [14]. However, a disadvantage
of PCCA is that it only finds free energy wells. It is not
suitable for any other clustering criteria. This article
proposes two greedy algorithms that generate clusters
optimizing any predefined target functions. The greedy
302 ©2014 Engineering and Technology Publishing
Lecture Notes on Information Theory Vol. 2, No. 4, December 2014
doi: 10.12720/lnit.2.4.302-309