Journal of Instruction-Level Parallelism 1 (2006) 1-15 Submitted 2/06; published Beneﬁts of Register-Level Lookup for a CELL SPU Math Library Christopher Kumar Anand anandc@mcmaster.ca Wei Li Anuroop Sharma Sanvesh Srivastava McMaster University, Computing and Software, ITB-201 1280 Main Street West, Hamilton, ON L8S 4K1 Canada Abstract This paper demonstrates techniques for increasing instruction-level parallelism via register- level lookups as an alternative to predication, using examples from a library of elementary math functions targeting CELL SPUs. Comparing the performance of our library with functions released in the CELL SDK 1.1 which do not use register-level lookup to improve parallelism, we measure a performance advantage of 40 percent on average. Since some functions are too simple to beneﬁt, the average masks much larger improvements for more complicated functions. We developed this library using a declarative assembly language and a SPU simulator written in Haskell. Through the examples, we show how this environment supports the rapid prototyping of compiler-optimization such as register-level lookups and the incorporation of code generation from mathematical models. The example code is documented in detail and should be a good introduction to these special features of the SPU instruction set architecture for both compiler writers and software developers. 1. Introduction In this paper, we summarize our experience developing basic mathematical subroutines for the novel Synergistic Processing Units (SPUs) within the ﬁrst implementation of the CELL broadband engine. For many applications heavy in single-precision ﬂoating point arithmetic, the ﬁrst CELL im- plementation with throughput of 64 ﬂops per cycle promises signiﬁcant advances in performance per watt, and we expect performance per dollar. But to get there will require an investment in better tools and practices. We hope to eventually share our tools and immediately inspire other tool builders, but we also hope to inﬂuence best programming practices by including non-trivial examples with detailed explanations. We found that for elementary function calculation, register-level lookup table implementations using byte permutation and taking advantage of the zero-cost in mixing integer, logical and ﬂoating point operations (because of the uniﬁed register ﬁle) produce signiﬁcant performance improvements over a more portable SIMD implementation. We have implemented all of the single-precision functions provided by IBM’s Math Acceleration SubSystem (MASS) library, using the techniques that we advocate. Performance comparison with the functions available in the Sony-Toshiba-IBM-supplied SDK [1] and pre-release hyperbolic func- tions, show that using these techniques improve throughput from a few percent to a factor of ten, with lookup-based implementations being twice as fast on average, for routines involving polyno- mial approximations (which excludes the short routines based on hardware interpolation like square root and divide). These are signiﬁcant gains given that the existing SDK sample code is already branch-free and coded to take advantage of the basic 4X SIMD speedup. Since we had originally planned to target VMX/Altivec with this work, we were curious as to the potential for additional parallel execution were the simple integer and logical instructions executable in parallel with ﬂoating point operations, as exists in some VMX/Altivec implementations. Fig. 4 shows quite a favorable instruction mix.