Radish: Compiling Efficient Query Plans for Distributed Shared Memory Brandon Myers University of Washington bdmyers@cs.washington.edu Daniel Halperin University of Washington dhalperi@cs Jacob Nelson University of Washington nelsonje@cs Mark Oskin University of Washington oskin@cs Luis Ceze University of Washington luisceze@cs Bill Howe University of Washington billhowe@cs ABSTRACT We present Radish, a query compiler that generates dis- tributed programs. Recent efforts have shown that compiling queries to machine code for a single-core can remove iterator and control overhead for significant performance gains. So far, systems that generate distributed programs only compile plans for single processors and stitch them together with messaging. In this paper, we describe an approach for translating query plans into distributed programs by targeting the partitioned global address space (PGAS) parallel programming model as an intermediate representation. This approach affords a natural adaptation of pipelining techniques used in single- core query compilers and an overall simpler design. We adapt pipelined algorithms to PGAS languages, describe efficient data structures for PGAS query execution, and implement techniques for mitigating the overhead resulting from handling a multitude of fine-grained tasks. We evaluate Radish on graph benchmark and application workloads and find that it is 4× to 100× faster than Shark, a recent distributed query engine optimized for in-memory execution. Our work makes important first steps towards ensuring that query processing systems can benefit from future advances in parallel programming and co-mingle with state-of-the-art parallel programs. 1 Introduction The state of the art for query execution on distributed clusters involves in-memory processing exemplified by systems such as Spark [38], which have demonstrated orders of magnitude performance improvement over earlier disk-oriented systems such as Hadoop [2] and Dryad [21]. These systems still incur significant overhead in serialization, iterators, and inter-process communication, suggesting an opportunity for improvement. Prior systems have demonstrated orders of magnitude performance improvements over iterator based query process- ing by compiling plans for single processors [10, 23, 27, 34]. Frameworks that generate distributed programs only compile plans for individual processors and stitch them together with communication calls, retaining the iterator model [15, 33]. These systems have a common shortcoming: they depend directly on a single-node compiler (e.g. LLVM, JVM) to perform machine-level optimizations, but these compilers cannot reason about a distributed program. The alternative, which we explore in this paper, is to generate programs for a partitioned global address space (PGAS) language (e.g., Chapel [11] or X10 [13]), then compile and execute these distributed programs to evaluate the query. A key feature of these languages is a partition, which is a region of memory local to a particular processor and far from other processors. Consider this query with a join and multiplication: SELECT R.a*R.b, R.b, S.b FROM R,S WHERE R.b=S.a; One side of the join corresponds to the following PGAS program: for r in R: on partition [ hash(r.b) ] for s in lookup(r.b) emit r.a*r.b, r.b, s.b This program is a representation of the physical plan: the join is a hash join with the relation r as the build relation. From this code, the PGAS compiler is now free to explore an additional class of decisions related to distributed exe- cution on the target machine. The explicit on partition construct instructs the compiler to send the iteration to the worker corresponding to the hash of the value r.b. The multiplication r.a*r.b could be computed either before or after transmission of tuple r over the network to worker hash(r.b). In terms of communication, there is no obvi- ous difference between these two choices; in either case, two numbers will be sent over the network: (r.a*r.b, r.b) in one case, and (r.a, r.b) in the other. However, a com- piler that understands this parallel code and the underlying architecture will consider the likelihood that the multiply functional unit is available. This kind of optimization is inaccessible to both a database-style algebraic optimizer that cannot reason about the instruction level and an ordinary shared memory compiler (e.g. LLVM, GCC) that cannot Technical Report UW-CSE-14-10-01 1