Turbine: A distributed-memory dataflow engine for extreme-scale many-task applications Justin M. Wozniak Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA wozniak@mcs.anl.gov Timothy G. Armstrong Computer Science Department University of Chicago Chicago, IL USA tga@uchicago.edu Ketan Maheshwari Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA ketan@mcs.anl.gov Ewing L. Lusk Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA lusk@mcs.anl.gov Daniel S. Katz Computation Institute University of Chicago & Argonne National Laboratory Chicago, IL USA d.katz@ieee.org Michael Wilde Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA wilde@mcs.anl.gov Ian T. Foster Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA foster@mcs.anl.gov ABSTRACT Efficiently utilizing the rapidly increasing concurrency of multi-petaflop computing systems is a significant program- ming challenge. One approach is to structure applications with an upper layer of many loosely-coupled coarse-grained tasks, each comprising a tightly-coupled parallel function or program. “Many-task” programming models such as func- tional parallel dataflow may be used at the upper layer to generate massive numbers of tasks, each of which generates significant tighly-coupled parallelism at the lower level via multithreading, message passing, and/or partitioned global address spaces. At large scales, however, the management of task distribution, data dependencies, and inter-task data movement is a significant performance challenge. In this work, we describe Turbine, a new highly scalable and dis- tributed many-task dataflow engine. Turbine executes a generalized many-task intermediate representation with au- tomated self-distribution, and is scalable to multi-petaflop infrastructures. We present here the architecture of Turbine and its performance on highly concurrent systems. Categories and Subject Descriptors D.3.3e [Programming Languages]: Concurrent program- ming structures Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SWEET 2012 May 20-25, Scottsdale, AZ, USA Copyright 2012 ACM 978-1-4503-1876-1/12/05 ...$15.00. General Terms Languages Keywords MPI, ADLB, Swift, Turbine, exascale, concurrency, dataflow 1. INTRODUCTION Developing programming solutions to help applications utilize the high concurrency of multi-petaflop computing systems is a challenge. Languages such as Dryad, Swift, and Skywriting provide a promising direction. Their im- plicitly parallel dataflow semantics allow the high-level logic of large-scale applications to be expressed in a manageable way while exposing massive parallelism through many-task programming. Current implementations of these languages, however, limit the evaluation of the dataflow program to a single-node computer, with resultant tasks distributed to other nodes for execution. We propose here a model for distributed-memory evalua- tion of dataflow programs that spreads the overhead of pro- gram evaluation and task generation throughout an extreme- scale computing system. This execution model enables func- tion and expression evaluation to take place on any node of the system. It breaks parallel loops and concurrent function invocations into fragments for distributed execution. The primary novel features of our workflow engine — use of dis- tributed memory and message passing — enable the scala- bility and task generation rates needed to efficiently utilize future systems. We also describe our implementation of this model in Tur- bine. This paper demonstrates that Turbine can execute Swift programs on large-scale, high performance computing (HPC) systems.