Poster Abstract: PUSH, a Dataflow Shell Noah Evans Alcatel-Lucent Bell Labs Antwerp, Belgium npe@plan9.cs.bell-labs.com Eric Van Hensbergen IBM Research Austin, TX bergevan@us.ibm.com 1. INTRODUCTION The deluge of huge data sets such as those provided by sensor networks, online transactions, and the web provide exciting opportunities for data analysis. The scale of the data makes it impossible to process in a reasonable amount of time on isolated machines. This has led to data flow sys- tems emerging as the standard tool for solving research prob- lems using these vast datasets. In typical dataflow systems, runtimes like Dryad [3] and Streamline [1] define graphs of processes, the edges of the graphs representing pipes, and their vertices representing computation. Within these run- times a new class of languages such as Sawzall [6] can be used by researchers to solve ”pleasantly parallel” problems (prob- lems where the individual elements of datasets are consid- ered to be independent of any other element) more quickly without worrying about explicit concurrency. These languages provide automated control flow (typically matched to the architecture of the underlying runtime) and communication channels between systems. In existing sys- tems, these workflows and the underlying computation are tightly linked, tying solutions to a particular runtime, work- flow and language. This creates difficulties for researchers who wish to draw upon tools written in many different lan- guages or runtimes which may be available on several differ- ent architectures or operating systems. We observe that UNIX pipes [4] were designed to get around many of these incompatibilities, allowing develop- ers to hook together tools written in different languages and runtimes in ad-hoc fashions. This allowed tool developers to focus on doing one thing well, and enabled code porta- bility and reuse in ways not originally conceived by the tool authors. The UNIX shell incorporated a model for tersely composing these smaller tools in pipelines (e.g. ’sort | uniq -c’), creating coherent workflows to solve more complicated problems quickly. Tools read from standard input and write to standard output, allowing programs to work together in streams with no explicit knowledge of this chaining built into the program itself. One-to-one pipelines such as those used by a typical UNIX shell, can not be trivially mapped to streaming workflows which incorporate one-to-many, many-to-many, and many- to-one data flows. Additionally, typical UNIX pipeline tools write data according to buffer boundaries instead of record boundaries. As Pike[6] notes, dataflow systems need to be Copyright is held by the author/owner(s). Eurosys April 14–16, 2010, Paris, France. ACM X-XXXXX-XX-X/XX/XX. able to cleanly separate input streams into records and then show that the order of these records is independent. By separating input and output into discrete unordered records data can be easily distributed and coalesced. 2. PUSH, A DATAFLOW SHELL To address these issues we have implemented a proto- type shell, which we call PUSH, using dataflow principles and incorporating extended pipeline operators to establish distributed workflows—potentially running on clusters of machines—and correlate results. The PUSH shell as part of the HARE project[8] , with the intent of using PUSH to deploy applications to millions of nodes, including large scale clusters such as a Blue Gene running the kittyhawk infrastructure, local distributed clusters, and dynamic clus- ters built using Amazon’s EC2 cloud. We are in the process of evaluating and optimizing performance for a variety of application types. We are also implementing a new version of PUSH, this one based on the RC shell[2] for easier integration into tra- ditional UNIX systems like Linux. This version is simplified and closer to the Bourne shell. The explicit goal of the new version of PUSH is integration with the Unified Execution Model(UEM)[7] which will allow the transparent distribu- tion of processes and the connection of their communication channels between machines. shell command Pipe Multiplexor Pipe Pipe Pipe Pipe Pipe Pipe shell command shell command shell command shell command shell command shell command Pipe Pipe Pipe Pipe Pipe Pipe Demultiplexor Pipe shell command Figure 1: The structure of a typical PUSH shell pipeline 3. DESIGN AND IMPLEMENTATION We have added two additional pipeline operators, a mul- tiplexing fan-out operator(|<[n ]), and a coalescing fan-in operator(>|). This combination allows PUSH to distribute I/O to and from multiple simultaneous threads of control. The fan-out argument n specifies the desired degree of par- allel threading. If no argument is specified, the default of