Parallel analysis in the MDAnalysis Library: Benchmark of Trajectory File Formats Mahzad Khoshlessan Oliver Beckstein February 24, 2017 MDAnalysis (http://mdanalysis.org) is a Python library to analyze molec- ular dynamics (MD) trajectories generated by all major MD simulation pack- ages. MDAnalysis enables users to access the raw simulation data through a uniform object-oriented Python interface and to perform structural and tem- poral analysis of their simulations. Simulations are continusously increasing in size and length and thus the amount of data to be analyzed is growing rapidly and analysis is increasingly becoming a bottleneck. Parallel approaches are needed to increase analysis throughput but MDAnalysis does not yet provide a standard interface for parallel analysis; instead, various existing parallel li- braries are currently used to parallelize MDAnalysis-based code. In this work, we describe a benchmark suite that can be used to evaluate performance for parallel map-reduce type analysis and use it investigate the performance of MDAnalysis with the Dask library for task-graph based distributed computing (http://dask.pydata.org/). As the computational task we perform an optimal structural superposition of the atoms of a protein to a reference structure by minimizing the RMSD of the C α . A range of commonly used MD ﬁle formats (CHARMM/NAMD DCD, Gromacs XTC, Amber NetCDF) and diﬀerent tra- jectory sizes are benchmarked on diﬀerent high performance computing (HPC) resources, ranging from XSEDE supercomputers with SSD or Lustre storage to local heterogeneous workstations with Gigabit-linked network ﬁle system or lo- cally attached SSDs. The benchmarks show a strong dependence of the overall execution time on the ﬁle format and the hardware. DCD is the fastest format to read but only scales to moderate core numbers when the ﬁles are served from SSDs; in general, contention of parallel workers for the ﬁle prevents scal- ing on most hardware. XTC appears overall as the most balanced format with consistently strong scaling and eﬃciency across most resource conﬁgurations. Parallelization within a node (up to 24 processes) with the dask multiprocess- ing scheduler is generally beneﬁcial but parallelization across multiple nodes (with dask distributed) only shows weak gains, likely due to contention on 1