Parallel analysis in the MDAnalysis Library: Benchmark of Trajectory File Formats Mahzad Khoshlessan Oliver Beckstein February 24, 2017 MDAnalysis (http://mdanalysis.org) is a Python library to analyze molec- ular dynamics (MD) trajectories generated by all major MD simulation pack- ages. MDAnalysis enables users to access the raw simulation data through a uniform object-oriented Python interface and to perform structural and tem- poral analysis of their simulations. Simulations are continusously increasing in size and length and thus the amount of data to be analyzed is growing rapidly and analysis is increasingly becoming a bottleneck. Parallel approaches are needed to increase analysis throughput but MDAnalysis does not yet provide a standard interface for parallel analysis; instead, various existing parallel li- braries are currently used to parallelize MDAnalysis-based code. In this work, we describe a benchmark suite that can be used to evaluate performance for parallel map-reduce type analysis and use it investigate the performance of MDAnalysis with the Dask library for task-graph based distributed computing (http://dask.pydata.org/). As the computational task we perform an optimal structural superposition of the atoms of a protein to a reference structure by minimizing the RMSD of the C α . A range of commonly used MD file formats (CHARMM/NAMD DCD, Gromacs XTC, Amber NetCDF) and different tra- jectory sizes are benchmarked on different high performance computing (HPC) resources, ranging from XSEDE supercomputers with SSD or Lustre storage to local heterogeneous workstations with Gigabit-linked network file system or lo- cally attached SSDs. The benchmarks show a strong dependence of the overall execution time on the file format and the hardware. DCD is the fastest format to read but only scales to moderate core numbers when the files are served from SSDs; in general, contention of parallel workers for the file prevents scal- ing on most hardware. XTC appears overall as the most balanced format with consistently strong scaling and efficiency across most resource configurations. Parallelization within a node (up to 24 processes) with the dask multiprocess- ing scheduler is generally beneficial but parallelization across multiple nodes (with dask distributed) only shows weak gains, likely due to contention on 1