Control Flow Emulation on Tiled SIMD Architectures Ghulam Lashari, Ondˇ rej Lhot´ ak, and Michael McCool D. R. Cheriton School of Computer Science, University of Waterloo Abstract. Heterogeneous multi-core and streaming architectures such as the GPU, Cell, ClearSpeed, and Imagine processors have better power/ performance ratios and memory bandwidth than traditional architec- tures. These types of processors are increasingly being used to accelerate compute-intensive applications. Their performance advantage is achieved by using multiple SIMD processor cores but limiting the complexity of each core, and by combining this with a simpliﬁed memory system. In particular, these processors generally avoid the use of cache coherency protocols and may even omit general-purpose caches, opting for restricted caches or explictly managed local memory. We show how control ﬂow can be emulated on such tiled SIMD ar- chitectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our pro- totype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buﬀered streaming. We emulate general control ﬂow in a way that is transparent to the programmer and include speciﬁc optimizations in our approach that can deal with double-buﬀering. 1 Introduction GPUs are high-performance processors originally designed for graphics acceler- ation. However, they are programmable and capable of accelerating a variety of demanding ﬂoating-point applications. They can often achieve performance that is more than an order of magnitude faster than corresponding CPU implemen- tations [1]. Application areas for which implementations have been performed include ray tracing, image and signal processing, computational geometry, ﬁ- nancial option pricing, sequence alignment, protein folding, database search, and many other problems in scientiﬁc computation including solving diﬀerential equations and optimization problems. These processors are best suited to massively parallel problems, and internally make extensive use of SIMD (single instruction, multiple data) parallelism. These processors do have multiple cores with separate threads of control, but each core uses SIMD execution. We will refer to such an architecture as a tiled SIMD architecture. Although we will focus on the GPU in this paper, even on the Cell L. Hendren (Ed.): CC 2008, LNCS 4959, pp. 100–115, 2008. c  Springer-Verlag Berlin Heidelberg 2008