1 Heterogeneous Streaming Chris J. Newburn, Gaurav Bansal, Michael Wood, Luis Crivelli, Judit Planas, Alejandro Duran chris.newburn@intel.com, gaurav2.bansal@intel.com, michael.wood@3DS.com, luis.crivelli@3DS.com, judit.planas@epfl.ch, alejandro.duran@intel.com Paulo Souza, Leonardo Borges, Piotr Luszczek, Stanimire Tomov, Jack Dongarra, Hartwig Anzt prps@petrobras.com.br, leonardo.borges@intel.com, luszczek@eecs.utk.edu, tomov@cs.utk.edu, dongarra@cs.utk.edu, hanzt@icl.utk.edu Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Ichitaro Yamazaki, Jesus Labarta mgates3@utk.edu, haidar@eecs.utk.edu, yjia@utk.edu, kkabir@eecs.utk.edu, iyamazak@utk.edu, jesus.labarta@bsc.es Abstract—This paper introduces a new heterogeneous stream- ing library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implemen- tation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams 1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. I. PROGRAMMING IN A HETEROGENEOUS ENVIRONMENT Effective concurrency among tasks is difficult to achieve, particularly on heterogeneous platforms. If this effort were more tractable, more people would tune their codes to achieve efficient performance. Our proposed hStreams framework makes it easier to port and tune task-parallel codes by offering the following features: Separation of concerns: The hStreams interface ad- dresses real-world programmer productivity concerns, by allowing a separation of concerns between 1) the expression of functional semantics and exposure of task concurrency, and 2) the performance tuning and control over how tasks are mapped to a platform. Creators of scientific algorithms who want to harness computing resources are generally not computer scientists; they want something simple and intuitive. Code tuners and runtime developers may work long after the original scientific developers have moved on from their creations, and they tend to want the freedom to control how code executes without acquiring application domain expertise. Sequential semantics: Many users find a valid sequence of task invocations more natural to express than providing a dependence graph of concurrent tasks. Our hStreams library offers a sequential FIFO stream abstraction to make concurrency more tractable and easier to debug. Task concurrency: The concurrency that we focus on for this work is among tasks. It is orthogonal to issues like code scheduling, vectorization, and threading, all of which are important optimizations to apply within tasks. Pipeline parallelism: Platforms with distributed re- sources tend to have significant communication latency 1 * Other brands and names are the property of their respective owners. and constrained bandwidth, that needs to be overlapped; pipelining the transfer of one tile of data while computing on another tile is often critical to performance. Unified interface to heterogeneous platforms: When frameworks like OpenMP require users to handle task execution differently on local or remote resources, they increase the burden on programmers. hStreams, in con- trast, offers a uniform task interface, to ease that burden. The contributions of this paper are as follows: 1) the hStreams library framework, and a demonstration of its applicability to heterogeneous platforms; 2) a comparison with other pro- gramming models and language interfaces; 3) a description of our approach to layering hstreams above other plumbing layers and below other interfaces in commercial codes and academic frameworks, with minimal overheads; 4) an evaluation of hStreams on relevant kernels and applications, for different platforms, and with respect to NVIDIA CUDA Streams. II. THE HSTREAMS LIBRARY The hStreams[1] library provides a streaming, task queue abstraction for heterogeneous platforms, similar to CUDA Streams R [2] and OpenCL. The hStreams library focuses on heterogeneous portability. It has now been open sourced on github; see https://github.com/01org/hetero-streams/wiki. It was previously distributed with the Intel R Many-core Platform Software Stack, versions 3.4 to 3.6. Features The hStreams library manages task concurrency across one or more units of heterogeneous computing resources that we call domains. It uses queues called streams to localize the dependence scope and to offer a FIFO (first-in, first- out) semantic. Memory is managed across domains using an abstraction called buffers, which are used to help manage properties and track dependences. These three component building blocks offer abstractions that enhance programmer productivity, provide transparency and control, and enable a separation of concerns between the scientist programmer and the one tuning for a target architecture. A domain is a set of computing and storage resources which share coherent memory and have some degree of locality. Examples of domains include a host CPU, a Knights family co-processor card, a node in a cluster reached across the fabric, a GPU, and a subset of cores that share a memory controller. Domains are discoverable and enumerable to users. Each domain has a set of properties that include the number, kind and speed of hardware threads, and the amount of each kind of memory, e.g. high-bandwidth memory.