Optimizing Assignment of Threads to SPEs on the Cell BE Processor C.D. Sudheer, T. Nagaraju, and P.K. Baruah Dept. of Mathematics and Computer Science Sri Sathya Sai University Prashanthi Nilayam, India sudheer@sssu.edu.in Ashok Srinivasan Dept. of Computer Science Florida State University Tallahassee, USA asriniva@cs.fsu.edu Abstract The Cell is a heterogeneous multicore processor that has attracted much attention in the HPC community. The bulk of the computational workload on the Cell processor is carried by eight co-processors called SPEs. The SPEs are connected to each other and to main memory by a high speed bus called the Element Interconnect Bus (EIB), which is capable of 204.8 GB/s. However, access to the main memory is limited by the performance of the Memory Interface Controller (MIC) to 25.6 GB/s. It is, therefore, advantageous for the algorithms to be structured such that SPEs communicate directly between themselves over the EIB, and make less use of memory. We show that the actual bandwidth obtained for inter-SPE communication is strongly influenced by the assignment of threads to SPEs (thread-SPE affinity) in many realistic communication patterns. We identify the bottlenecks to optimal performance and use this information to deter- mine good affinities for common communication patterns. Our solutions improve performance by up to a factor of two over the default assignment. We also discuss the optimization of affinity on a Cell blade consisting of two Cell processors, and provide a software tool to help with this. Our results will help Cell application developers choose good affinities for their applications. 1. Introduction The Cell is a heterogeneous multi-core processor that has attracted much attention in the HPC community. It contains a PowerPC core, called the PPE, and eight co-processors, called SPEs. The SPEs are meant to handle the bulk of the computational workload, and have a combined peak speed of 204.8 Gflop/s in single precision and 14.64 Gflop/s in double precision. They are connected to each other and to main memory by a high speed bus called the EIB, which has a bandwidth of 204.8 GB/s. However, access to main memory is limited by the mem- ory interface controller’s performance to 25.6 GB/s total (both directions combined). If all eight SPEs access main memory simultaneously, then each sustains bandwidth less than 4 GB/s. On the other hand, each SPE is capable of simultaneously sending and receiving data at 25.6 GB/s in each direction. Latency for inter-SPE communication is under 100 ns for short messages, while it is a factor of two greater to main memory. It is, therefore, advantageous for algorithms to be structured such that SPEs tend to communicate more between themselves, and make less use of main memory. The latency between each pair of SPEs is identical for short messages and so affinity does not matter in this case. In the absence of contention for the EIB, the bandwidth between each of pair of SPEs is identical for long messages too, and reaches the theoretical limit. However, we show later that in the presence of contention, the bandwidth can fall well short of the theoretical limit, even when the EIB’s bandwidth is not saturated. This happens when the message size is greater than 16 KB. It is, therefore, important to assign threads to SPEs to avoid contention, in order to maximize the bandwidth for the communication pattern of the application. We first identify causes for the loss in performance, and use this information to develop good thread-SPE affinity schemes for common communication patterns, such as ring, binomial-tree, and recursive doubling. We show that our schemes can improve performance by over a factor of two over a poor choice of assignments. By default, the assignment scheme provided is somewhat random, which sometimes leads to poor affinities and sometimes to good ones. With many communication patterns, our schemes yield performance that is close to twice as good as the average performance of the default scheme. Our schemes also lead to more predictable performance, in the sense that the standard deviation of the bandwidth obtained is lower. We also discuss optimization of affinity on a Cell blade consisting of two Cell processors. We show that the affinity within each processor is often less important than the assignment of threads to processors. We provide a software tool that partitions the threads amongst the two processors, to yield good performance. The outline of the rest of the paper is as follows. In § 2, we summarize important architectural features of the Cell processor relevant to this paper. We next show that thread-SPE affinity can have significant influence on inter-