Automatic C-to-CUDA Code Generation for Afﬁne Programs Muthu Manikandan Baskaran 1 , J. Ramanujam 2 , and P. Sadayappan 1 1 The Ohio State University, USA 2 Louisiana State University, USA Abstract. Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Uniﬁed Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierar- chy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequen- tial input programs into efﬁcient parallel CUDA programs is of considerable in- terest. This paper describes an automatic code transformation system that gener- ates parallel CUDA code from input sequential C code, for regular (afﬁne) pro- grams. Using and adapting publicly available tools that have made polyhedral compiler optimization practically effective, we develop a C-to-CUDA transfor- mation system that generates two-level parallel CUDA code that is optimized for efﬁcient data access. The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. The performance of the automatically generated CUDA code is quite close to hand-optimized CUDA code and considerably better than the benchmarks’ per- formance on a multicore CPU. 1 Introduction Graphics Processing Units (GPUs) represent the most powerful multi-core systems cur- rently in use. For example, the NVIDIA GeForce 8800 GTX GPU chip has a peak performance of over 350 GFLOPS and the NVIDIA GeForce GTX 280 chip has a peak performance of over 900 GFLOPS. There has been considerable recent interest in using GPUs for general purpose computing [8,13,12]. Until recently, general-purpose computations on GPUs were performed by transforming matrix operations into special- ized graphics processing, such as texture operations. The introduction of the CUDA (Compute Uniﬁed Device Architecture) programming model by NVIDIA provided a general-purpose multi-threaded model for implementation of general-purpose compu- tations on GPUs. Although more convenient than previous graphics programming APIs for developing GPGPU codes, the manual development of high-performance codes with the CUDA model is still much more complicated than the use of parallel program- ming models such as OpenMP for general-purpose multi-core systems. It is therefore of great interest, for enhanced programmer productivity and for software quality, to develop compiler support to facilitate the automatic transformation of sequential input programs into efﬁcient parallel CUDA programs. R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 244–263, 2010. c  Springer-Verlag Berlin Heidelberg 2010