40 PROC. OF THE 20th PYTHON IN SCIENCE CONF. (SCIPY 2021) MPI-parallel Molecular Dynamics Trajectory Analysis with the H5MD Format in the MDAnalysis Python Package Edis Jakupovic , Oliver Beckstein * Abstract—Molecular dynamics (MD) computer simulations help elucidate de- tails of the molecular processes in complex biological systems, from protein dynamics to drug discovery. One major issue is that these MD simulation files are now commonly terabytes in size, which means analyzing the data from these files becomes a painstakingly expensive task. In the age of national supercomputers, methods of parallel analysis are becoming a necessity for the efficient use of time and high performance computing (HPC) resources but for any approach to parallel analysis, simply reading the file from disk becomes the performance bottleneck that limits overall analysis speed. One promising way around this file I/O hurdle is to use a parallel message passing interface (MPI) implementation with the HDF5 (Hierarchical Data Format 5) file format to access a single file simultaneously with numerous processes on a parallel file system. Our previous feasibility study suggested that this combination can lead to favorable parallel scaling with hundreds of CPU cores, so we implemented a fast and user-friendly HDF5 reader (the H5MDReader class) that adheres to H5MD (HDF5 for Molecular Dynamics) specifications. We made H5MDReader (together with a H5MD output class H5MDWriter) available in the MDAnalysis library, a Python package that simplifies the process of reading and writing vari- ous popular MD file formats by providing a streamlined user-interface that is in- dependent of any specific file format. We benchmarked H5MDReader’s parallel file reading capabilities on three HPC clusters: ASU Agave, SDSC Comet, and PSC Bridges. The benchmark consisted of a simple split-apply-combine scheme of an I/O bound task that split a 90k frame (113 GiB) coordinate trajectory into N chunks for N processes, where each process performed the commonly used RMSD (root mean square distance after optimal structural superposition) calculation on their chunk of data, and then gathered the results back to the root process. For baseline performance, we found maximum I/O speedups at 2 full nodes, with Agave showing 20x, and a maximum computation speedup on Comet of 373x on 384 cores (all three HPCs scaled well in their computation task). We went on to test a series of optimizations attempting to speed up I/O performance, including adjusting file system stripe count, implementing a masked array feature that only loads relevant data for the computation task, front loading all I/O by loading the entire trajectory into memory, and manually adjusting the HDF5 dataset chunk shapes. We found the largest improvement in I/O performance by optimizing the chunk shape of the HDF5 datasets to match the iterative access pattern of our analysis benchmark. With respect to baseline serial performance, our best result was a 98x speedup at 112 cores on ASU Agave. In terms of absolute time saved, the analysis went from 4623 seconds in the baseline serial run to 47 seconds in the parallel, properly chunked run. Our results emphasize the fact that file I/O is not just dependent on the access ‡ Arizona State University * Corresponding author: obeckste@asu.edu Copyright © 2021 Edis Jakupovic et al. This is an open-access article dis- tributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- vided the original author and source are credited. pattern of the file, but more so the synergy between access pattern and the layout of the file on disk. Index Terms—Molecular Dynamics Simulations, High Performance Computing, Python, MDAnalysis, HDF5, H5MD, MPI I/O Introduction The molecular dynamics (MD) simulation approach [HBD + 19] is widely used across the biomolecular and materials sciences, accounting for more than one quarter of the total computing time [FQC + 19] in the Extreme Science and Engineering Discovery Environment (XSEDE) network of national supercomputers in the US [TCD + 14]. MD simulations, especially in the realm of studying protein dynamics, serve an important purpose in charac- terizing the dynamics, and ultimately the function of a protein [Oro14]. For example, recent award-winning work [CDG + 21] involving the SARS-CoV-2 spike protein was able to use all- atom MD simulations to elucidate the dynamics of the virus-to- human cell interaction that was inaccessible to experiment. While the parameters involved in fine tuning the physics driving these simulations continue to improve, the computational demand of longer, more accurate simulations increases [DDG + 12]. As high performance computing (HPC) resources continue to improve in performance, the size of MD simulation files are now commonly terabytes in size, making serial analysis of these trajectory files impractical [CR15]. Parallel analysis is a necessity for the effi- cient use of both HPC resources and a scientist’s time [BFJ18], [FQC + 19]. MD trajectory analysis can be parallelized using task- based or MPI-based (message passing interface) approaches, each with their own advantages and disadvantages [PLK + 18]. Here we investigate parallel trajectory analysis with the MDAnalysis Python library [MADWB11], [GLB + 16]. MDAnalysis is a widely used package in the molecular simulation community that can read and write over 25 popular MD trajectory file formats while providing a common object-oriented interface that makes data available as numpy arrays [HMvdW + 20]. MDAnalysis aims to bridge the entrenched user communities of different MD packages, allowing scientists to more easily (and productively) move be- tween these entrenched communities. Previous work that focused on developing a task-based approach to parallel analysis found that an I/O bound task only scaled to 12 cores due to a file I/O bottleneck [SFMLIP + 19]. Our recent feasibility study suggested that parallel reading via MPI-IO and the HDF5 file format could