1 Characterizing the Communication Demands of the Graph500 Benchmark on a Commodity Cluster Pablo Fuentes, Jos´ e Luis Bosque, Ram´ on Beivide University of Cantabria, Santander, Spain {pablo.fuentes, joseluis.bosque, ramon.beivide}@unican.es Mateo Valero Barcelona Supercomputing Center, Barcelona, Spain mateo@bsc.es Cyriel Minkenberg IBM Zurich Research Laboratory, Zurich, Switzerland sil@zrl.ibm.com This is an earlier accepted version; a final version of this work can be found in the Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing (BDC 2014) under DOI 10.1109/BDC.2014.16. Abstract can be read here. Copyright belongs to IEEE. Abstract—Big Data applications have gained importance over the last few years. Such applications focus on the analysis of huge amounts of unstructured information and present a series of differences with traditional High Performance Computing (HPC) applications. For illustrating such dissimilarities, this paper analyzes the behavior of the most scalable version of the Graph500 benchmark when run on a state-of-the-art commodity cluster facility. Our work shows that this new computation paradigm stresses the interconnection subsystem. In this work, we provide both analytical and empirical characterizations of the Graph500 benchmark, showing that its communication needs bound the achieved performance on a cluster facility. Up to our knowledge, our evaluation is the first to consider the impact of message aggregation on the communication overhead and explore a tradeoff that diminishes benchmark execution time, increasing system performance. Keywords-Graph500, cluster supercomputing platforms, com- munication characterization, message aggregation I. I NTRODUCTION Over the last few decades there has been an exponential rise in the amount of data to be processed in multiple human ac- tivities. Furthermore, in parallel to this data increase there has been a fast growth in its complexity. Both factors originate a need for higher computational capacities and introduce several algorithmic challenges. Such challenges can be summarized in a need to discover patterns in the data, to create a framework to analyze those patterns facing time and resource restrictions, and to predict future behaviors upon those patterns. These challenges are valid in fields as diverse as social networks, medical informatics, banking and cybersecurity among others. One example is the use of the Facebook social network, whose number of active monthly users has grown exponentially in last years before achieving more than 1.2 billion as of late 2013. The core idea behind these Big Data problems is to extract knowledge from a huge range of unstructured data, to ease analysis and decision taking. A convenient model for this information is a graph, which permits to reorganize data by searching spanning trees embedded in the graph. The size of these Big Data applications requires high amounts of memory and computing capabilities. The state- of-the-art computing server performance does not meet such requirements and does not scale at their growth pace. The only realistic option is to run such applications in parallel, partition- ing the graph and its associated computation across several processes. Parallel computers have been constantly used in other computing fields for decades but are not optimized for this new set of problems. In this context, Graph500 [1] organization arises to gather international High Performance Computing (HPC) experts from the industry and academia, with the aim to determine the capacity of current computing systems to run graph-based applications. Its main contribution is a large-scale benchmark which performs a typical concurrent tree search algorithm called Breadth-First Search (BFS). Graph500 proposes a new performance metric, named the number of Traversed Edges Per Second (TEPS). TEPS are calculated as the division of the number of edges traversed in the graph by the execution time for kernel 2. The Graph500 benchmark consists of various kernels working with large graphs: 1) Construction of a graph from a random edge list (kernel 1). 2) Ancestors tree computation for a random sample search key through a BFS algorithm (kernel 2). 3) Validation of the parent tree from kernel 2, through an assertion of accomplished properties. Due to the extended and fast-rising relevance of Big Data applications, it is essential to determine the existence of any possible performance losses, and identify their sources. In this regard, a comprehensive study of the specific algorithm behav- ior can be indispensable to locate and minimize inefficiencies in the code execution, and to optimize resource usage. The variety of these applications turns unfeasible to repeat such scrutiny over each of them. The Graph500 benchmark can be a good start point, as it represents a subset of graph- based, data-intensive applications. A more ambitious, long-