40 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021)
MPI-parallel Molecular Dynamics Trajectory Analysis
with the H5MD Format in the MDAnalysis Python
Package
Edis Jakupovic
‡
, Oliver Beckstein
‡*
✦
Abstract—Molecular dynamics (MD) computer simulations help elucidate de-
tails of the molecular processes in complex biological systems, from protein
dynamics to drug discovery. One major issue is that these MD simulation files
are now commonly terabytes in size, which means analyzing the data from
these files becomes a painstakingly expensive task. In the age of national
supercomputers, methods of parallel analysis are becoming a necessity for the
efficient use of time and high performance computing (HPC) resources but for
any approach to parallel analysis, simply reading the file from disk becomes
the performance bottleneck that limits overall analysis speed. One promising
way around this file I/O hurdle is to use a parallel message passing interface
(MPI) implementation with the HDF5 (Hierarchical Data Format 5) file format to
access a single file simultaneously with numerous processes on a parallel file
system. Our previous feasibility study suggested that this combination can lead
to favorable parallel scaling with hundreds of CPU cores, so we implemented a
fast and user-friendly HDF5 reader (the H5MDReader class) that adheres to
H5MD (HDF5 for Molecular Dynamics) specifications. We made H5MDReader
(together with a H5MD output class H5MDWriter) available in the MDAnalysis
library, a Python package that simplifies the process of reading and writing vari-
ous popular MD file formats by providing a streamlined user-interface that is in-
dependent of any specific file format. We benchmarked H5MDReader’s parallel
file reading capabilities on three HPC clusters: ASU Agave, SDSC Comet, and
PSC Bridges. The benchmark consisted of a simple split-apply-combine scheme
of an I/O bound task that split a 90k frame (113 GiB) coordinate trajectory
into N chunks for N processes, where each process performed the commonly
used RMSD (root mean square distance after optimal structural superposition)
calculation on their chunk of data, and then gathered the results back to the
root process. For baseline performance, we found maximum I/O speedups at 2
full nodes, with Agave showing 20x, and a maximum computation speedup on
Comet of 373x on 384 cores (all three HPCs scaled well in their computation
task). We went on to test a series of optimizations attempting to speed up
I/O performance, including adjusting file system stripe count, implementing a
masked array feature that only loads relevant data for the computation task,
front loading all I/O by loading the entire trajectory into memory, and manually
adjusting the HDF5 dataset chunk shapes. We found the largest improvement in
I/O performance by optimizing the chunk shape of the HDF5 datasets to match
the iterative access pattern of our analysis benchmark. With respect to baseline
serial performance, our best result was a 98x speedup at 112 cores on ASU
Agave. In terms of absolute time saved, the analysis went from 4623 seconds
in the baseline serial run to 47 seconds in the parallel, properly chunked run.
Our results emphasize the fact that file I/O is not just dependent on the access
‡ Arizona State University
*
Corresponding author: obeckste@asu.edu
Copyright © 2021 Edis Jakupovic et al. This is an open-access article dis-
tributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, pro-
vided the original author and source are credited.
pattern of the file, but more so the synergy between access pattern and the
layout of the file on disk.
Index Terms—Molecular Dynamics Simulations, High Performance Computing,
Python, MDAnalysis, HDF5, H5MD, MPI I/O
Introduction
The molecular dynamics (MD) simulation approach [HBD
+
19]
is widely used across the biomolecular and materials sciences,
accounting for more than one quarter of the total computing time
[FQC
+
19] in the Extreme Science and Engineering Discovery
Environment (XSEDE) network of national supercomputers in
the US [TCD
+
14]. MD simulations, especially in the realm of
studying protein dynamics, serve an important purpose in charac-
terizing the dynamics, and ultimately the function of a protein
[Oro14]. For example, recent award-winning work [CDG
+
21]
involving the SARS-CoV-2 spike protein was able to use all-
atom MD simulations to elucidate the dynamics of the virus-to-
human cell interaction that was inaccessible to experiment. While
the parameters involved in fine tuning the physics driving these
simulations continue to improve, the computational demand of
longer, more accurate simulations increases [DDG
+
12]. As high
performance computing (HPC) resources continue to improve in
performance, the size of MD simulation files are now commonly
terabytes in size, making serial analysis of these trajectory files
impractical [CR15]. Parallel analysis is a necessity for the effi-
cient use of both HPC resources and a scientist’s time [BFJ18],
[FQC
+
19]. MD trajectory analysis can be parallelized using task-
based or MPI-based (message passing interface) approaches, each
with their own advantages and disadvantages [PLK
+
18]. Here
we investigate parallel trajectory analysis with the MDAnalysis
Python library [MADWB11], [GLB
+
16]. MDAnalysis is a widely
used package in the molecular simulation community that can
read and write over 25 popular MD trajectory file formats while
providing a common object-oriented interface that makes data
available as numpy arrays [HMvdW
+
20]. MDAnalysis aims to
bridge the entrenched user communities of different MD packages,
allowing scientists to more easily (and productively) move be-
tween these entrenched communities. Previous work that focused
on developing a task-based approach to parallel analysis found
that an I/O bound task only scaled to 12 cores due to a file I/O
bottleneck [SFMLIP
+
19]. Our recent feasibility study suggested
that parallel reading via MPI-IO and the HDF5 file format could