Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms Henri Casanova a , Arnaud Giersch b , Arnaud Legrand c , Martin Quinson d , Fr´ ed´ eric Suter e,f,* a Dept. of Information and Computer Sciences, University of Hawai‘i at Manoa, U.S.A b FEMTO-ST, University of Franche-Comt´ e, Belfort, France c LIG, CNRS, Grenoble University, France d LORIA, Universit´ e de Lorraine, France e IN2P3 Computing Center, CNRS/IN2P3, Lyon-Villeurbanne, France f LIP, INRIA, ENS Lyon, Lyon, France Abstract The study of parallel and distributed applications and platforms, whether in the cluster, grid, peer-to-peer, vol- unteer, or cloud computing domain, often mandates empirical evaluation of proposed algorithmic and system solutions via simulation. Unlike direct experimentation via an application deployment on a real-world testbed, simulation enables fully repeatable and conﬁgurable experiments for arbitrary hypothetical scenarios. Two key concerns are accuracy (so that simulation results are scientiﬁcally sound) and scalability (so that simulation exper- iments can be fast and memory-eﬃcient). While the scalability of a simulator is easily measured, the accuracy of many state-of-the-art simulators is largely unknown because they have not been suﬃciently validated. In this work we describe recent accuracy and scalability advances made in the context of the SimGrid simulation framework. A design goal of SimGrid is that it should be versatile, i.e., applicable across all aforementioned domains. We present quantitative results that show that SimGrid compares favorably to state-of-the-art domain-speciﬁc simula- tors in terms of scalability, accuracy, or the trade-oﬀ between the two. An important implication is that, contrary to popular wisdom, striving for versatility in a simulator is not an impediment but instead is conducive to improving both accuracy and scalability. Keywords: Simulation, validation, scalability, versatility, SimGrid 1. Introduction The use of parallel and distributed computing platforms is pervasive in a wide range of contexts and for a wide range of applications. High Performance Computing (HPC) has been a consumer of and driver for these platforms. In particular, commodity clusters built from oﬀ-the-shelf computers interconnected with switches have been used for applications in virtually all ﬁelds of science and engineering, and exascale systems with millions of cores are already envisioned. Platforms that aggregate multiple clusters over wide-area networks, or grids, have received a lot of attention over the last decade with both speciﬁc software infrastructures and application deployments. Distributed applications and platforms have also come to prominence in the peer-to-peer and volunteer computing domains (e.g., for content sharing, scientiﬁc computing, data storage and retrieval, media streaming), enabled by the impressive capabilities of personal computers and high-speed personal Internet connections. Finally, cloud computing relies on the use of large-scale distributed platforms that host virtualized resources leased to consumers of compute cycles and storage space. While large-scale production platforms have been deployed and used successfully in all these domains, many open questions remain. Relevant challenges include resource management, resource discovery and monitoring, application scheduling, data management, decentralized algorithms, electrical power management, resource eco- nomics, fault-tolerance, scalability, and performance. Regardless of the speciﬁc context and of the research ques- tion at hand, studying and understanding the behavior of applications on distributed platforms is diﬃcult. The goal is to assess the quality of competing algorithmic and system designs with respect to precise objective metrics. Theoretical analysis is typically tractable only when using stringent and ultimately unrealistic assumptions. As a result, relevant research is mostly empirical and proceeds as follows. An experiment consists in executing a * Corresponding author (frederic.suter@cc.in2p3.fr) Preprint submitted to Journal of Parallel and Distributed Computing July 2, 2014 hal-01017319, version 1 -