Greedy Algorithms for Fast Discovery of Macrostates from Molecular Dynamics Simulations Haoyun Feng Department of Computer Science and Engineering, Notre Dame, US Email: hfeng@nd.edu Geoffrey Siwo Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, US Email: siwomolbio@gmail.com Jesus A. Izaguirre, Douglas Thain, and Badi Abdul-Wahid Department of Computer Science and Engineering, Notre Dame, US Email: {izaguirr, dthain, cabdulwa}@nd.edu Abstract—With development of distributed computing systems, it is possible to significantly accelerate long-term molecular dynamics simulations by using ensemble algorithms, such as Markov State Models (MSM) and Weighted Ensemble (WE). Decomposing the conformational space of molecule into macrostates is an important step of both methods. To ensure efficiency and accuracy of ensemble methods, it is necessary that the macro states are defined according to certain kinetic properties. Monte Carlo simulated annealing (MCSA) has been widely applied to define macro states with optimal metastability of the dynamical system. This article proposes two greedy algorithms, G1 and G2, based on different definitions of local search space to improve efficiency and scalability of MCSA on distributed computing system. Numerical experiments are conducted on two biological systems, alanine dipeptide and WW domain. The numerical experiments demonstrate that G1 is the most efficient of the three on a single core machine and distributed computing system. Sequential version of G2 is the slowest but it gains the most speed up on distributed computing systems. Index Terms—molecular dynamics, metastable states, unsupervised clustering, greedy algorithm. I. INTRODUCTION Molecular Dynamics simulation [1] is a numerical tool for simulating movements of molecules. It generates a sequence of coordinates representing Brownian motion of molecules, which is called a trajectory. Biological studies such as drug design usually require longterm (mili- seconds) simulations that may cost several years of CPU clock time. However, with development of distributed computing systems, it is possible to generate a large amount of short trajectories in parallel. Recovering long- term dynamics from a large number of short trajectories Manscript received October 12, 2014; revised January 5, 2015. has drawn the attention of many researchers and a number of techniques have been developed for such purpose. The two most popular techniques are Markov State Models (MSM) [2]-[6] and Weighted Ensembles (WE) [2], [7]-[11]. Partitioning the continuous conformational space into macrostates plays a key role in both techniques. MSM calculates implied time scales of the dynamics from the transition matrix, within which each entry represents transition probability between two macrostates. The quality of macrostate definitions affects accuracy of MSM. On the other hand, WE is a distributed algorithm that provides efficient sampling on conformational space and reliable estimation of reaction rate between free energy wells. Macrostates are defined so that WE samples the conformational space evenly by maintaining a constant number of samples in every macrostate. Efficiency of WE depends on the underlying partition. A number of studies have been done on the automatic discovery of macrostates. Chodera proposed to find clusters for MSM using Perron cluster cluster analysis (PCCA) and Monte Carlo simulated annealing (MCSA) [12]. PCCA is efficient in time complexity but has statistical errors. MCSA takes PCCA result as initial solution and continues to reduce the statistical errors. The advantage of MCSA is that it guarantees for finding the optimal clusters, but the time complexity is high and unpredictable, especially the sequential version demonstrated in [12]. Chang and Yao proposed to reduce statistical errors of PCCA by clustering densely and sparsely sampled conformational spaces separately, so that the sparsely sampled space won't affect accuracy on the overall clustering [13], [14]. However, a disadvantage of PCCA is that it only finds free energy wells. It is not suitable for any other clustering criteria. This article proposes two greedy algorithms that generate clusters optimizing any predefined target functions. The greedy 302 ©2014 Engineering and Technology Publishing Lecture Notes on Information Theory Vol. 2, No. 4, December 2014 doi: 10.12720/lnit.2.4.302-309