Parakeet: A Just-In-Time Parallel Accelerator for Python Alex Rubinsteyn Eric Hielscher Nathaniel Weinman Dennis Shasha Computer Science Department, New York University, New York, NY, 10003 {alexr,hielscher,nsw233,shasha} @ cs.nyu.edu Abstract High level productivity languages such as Python or Mat- lab enable the use of computational resources by non- expert programmers. However, these languages often sac- riﬁce program speed for ease of use. This paper proposes Parakeet, a library which provides a just-in-time (JIT) parallel accelerator for Python. Para- keet bridges the gap between the usability of Python and the speed of code written in efﬁciency languages such as C++ or CUDA. Parakeet accelerates data-parallel sections of Python that use the standard NumPy scientiﬁc comput- ing library. Parakeet JIT compiles efﬁcient versions of Python functions and automatically manages their execu- tion on both GPUs and CPUs. We assess Parakeet on a pair of benchmarks and achieve signiﬁcant speedups. 1 Introduction Numerical computing is an indispensable tool to profes- sionals in a wide range of ﬁelds, from the natural sciences to the ﬁnancial industry. Often, users in these ﬁelds ei- ther (1) aren’t expert programmers; or (2) don’t have time to tune their software for performance. These users typi- cally prefer to use productivity languages such as Python or Matlab rather than efﬁciency languages such as C++. Productivity languages facilitate non-expert programmers by trading off program speed for ease of use [23]. One problem, however, is that the performance tradeoff is often very stark – code written in Python or Matlab [19] often has much worse performance than code written in C++ or Fortran. This problem is getting worse, as mod- ern processors (multicore CPUs as well as GPUs) are all parallel, and current implementations of productivity lan- guages are poorly suited for parallelism. Thus a common workﬂow involves prototyping algorithms in a productiv- ity language, followed by porting the performance-critical sections to a lower level language. This second step can be time-consuming, error-prone, and it diverts energy from the real focus of these users’ work. In this paper, we present Parakeet, a library that provides a JIT parallel accelerator for NumPy, the commonly-used scientiﬁc computing library for Python [22]. Parakeet accelerates performance-critical sections of numerical Python programs to be competitive with efﬁciency language code, obviating the need for the above-mentioned “prototype, port” cycle. The Parakeet library intercepts programmer-marked functions and uses high-level operations on NumPy ar- rays (e.g. mapping a function over the array’s elements) as sources of parallelism. These functions are just-in-time compiled to either x86 machine code using LLVM [17] or GPU code that can be executed on NVIDIA GPUs via the CUDA framework [20]. These native versions of the functions are then automatically executed on the appropri- ate hardware. Parakeet allows complete interoperability with all of the standard Python tools and libraries. Parakeet currently supports JIT compilation to paral- lel GPU programs and single-threaded CPU programs. While Parakeet is a work in progress, our current results clearly demonstrate its promise. 2 Overview Parakeet is an accelerator library for numerical Python al- gorithms written using the NumPy array extensions [22]. Parakeet does not replace the standard Python runtime but rather selectively augments it. To run a function within Parakeet a user must wrap it with the decorator @PAR. For example, consider the following NumPy code for averag- ing the value of two arrays: @PAR def avg(x,y): return (x+y) / 2.0 If the decorator @PAR were removed, then avg would run as ordinary Python code. Since NumPy’s library func- tions are compiled separately they always allocate result arrays (even when the arrays are immediately consumed). By contrast, Parakeet specializes avg for any distinct input 1