Journal of Instruction-Level Parallelism 1 (2006) 1-15 Submitted 2/06; published Benefits of Register-Level Lookup for a CELL SPU Math Library Christopher Kumar Anand anandc@mcmaster.ca Wei Li Anuroop Sharma Sanvesh Srivastava McMaster University, Computing and Software, ITB-201 1280 Main Street West, Hamilton, ON L8S 4K1 Canada Abstract This paper demonstrates techniques for increasing instruction-level parallelism via register- level lookups as an alternative to predication, using examples from a library of elementary math functions targeting CELL SPUs. Comparing the performance of our library with functions released in the CELL SDK 1.1 which do not use register-level lookup to improve parallelism, we measure a performance advantage of 40 percent on average. Since some functions are too simple to benefit, the average masks much larger improvements for more complicated functions. We developed this library using a declarative assembly language and a SPU simulator written in Haskell. Through the examples, we show how this environment supports the rapid prototyping of compiler-optimization such as register-level lookups and the incorporation of code generation from mathematical models. The example code is documented in detail and should be a good introduction to these special features of the SPU instruction set architecture for both compiler writers and software developers. 1. Introduction In this paper, we summarize our experience developing basic mathematical subroutines for the novel Synergistic Processing Units (SPUs) within the first implementation of the CELL broadband engine. For many applications heavy in single-precision floating point arithmetic, the first CELL im- plementation with throughput of 64 flops per cycle promises significant advances in performance per watt, and we expect performance per dollar. But to get there will require an investment in better tools and practices. We hope to eventually share our tools and immediately inspire other tool builders, but we also hope to influence best programming practices by including non-trivial examples with detailed explanations. We found that for elementary function calculation, register-level lookup table implementations using byte permutation and taking advantage of the zero-cost in mixing integer, logical and floating point operations (because of the unified register file) produce significant performance improvements over a more portable SIMD implementation. We have implemented all of the single-precision functions provided by IBM’s Math Acceleration SubSystem (MASS) library, using the techniques that we advocate. Performance comparison with the functions available in the Sony-Toshiba-IBM-supplied SDK [1] and pre-release hyperbolic func- tions, show that using these techniques improve throughput from a few percent to a factor of ten, with lookup-based implementations being twice as fast on average, for routines involving polyno- mial approximations (which excludes the short routines based on hardware interpolation like square root and divide). These are significant gains given that the existing SDK sample code is already branch-free and coded to take advantage of the basic 4X SIMD speedup. Since we had originally planned to target VMX/Altivec with this work, we were curious as to the potential for additional parallel execution were the simple integer and logical instructions executable in parallel with floating point operations, as exists in some VMX/Altivec implementations. Fig. 4 shows quite a favorable instruction mix.