Control Flow Emulation on Tiled SIMD Architectures Ghulam Lashari, Ondˇ rej Lhot´ ak, and Michael McCool D. R. Cheriton School of Computer Science, University of Waterloo Abstract. Heterogeneous multi-core and streaming architectures such as the GPU, Cell, ClearSpeed, and Imagine processors have better power/ performance ratios and memory bandwidth than traditional architec- tures. These types of processors are increasingly being used to accelerate compute-intensive applications. Their performance advantage is achieved by using multiple SIMD processor cores but limiting the complexity of each core, and by combining this with a simplified memory system. In particular, these processors generally avoid the use of cache coherency protocols and may even omit general-purpose caches, opting for restricted caches or explictly managed local memory. We show how control flow can be emulated on such tiled SIMD ar- chitectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our pro- totype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buffered streaming. We emulate general control flow in a way that is transparent to the programmer and include specific optimizations in our approach that can deal with double-buffering. 1 Introduction GPUs are high-performance processors originally designed for graphics acceler- ation. However, they are programmable and capable of accelerating a variety of demanding floating-point applications. They can often achieve performance that is more than an order of magnitude faster than corresponding CPU implemen- tations [1]. Application areas for which implementations have been performed include ray tracing, image and signal processing, computational geometry, fi- nancial option pricing, sequence alignment, protein folding, database search, and many other problems in scientific computation including solving differential equations and optimization problems. These processors are best suited to massively parallel problems, and internally make extensive use of SIMD (single instruction, multiple data) parallelism. These processors do have multiple cores with separate threads of control, but each core uses SIMD execution. We will refer to such an architecture as a tiled SIMD architecture. Although we will focus on the GPU in this paper, even on the Cell L. Hendren (Ed.): CC 2008, LNCS 4959, pp. 100–115, 2008. c Springer-Verlag Berlin Heidelberg 2008