Scalability Evaluation of a Polymorphic Register File: a CG Case Study C˘ at˘alin B. Ciobanu 1 , Xavier Martorell 2,3 , Georgi K. Kuzmanov 1 , Alex Ramirez 2,3 , and Georgi N. Gaydadjiev 1 1 Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology, The Netherlands {c.b.ciobanu, g.k.kuzmanov, g.n.gaydadjiev}@tudelft.nl 2 Universitat Polit` ecnica de Catalunya, Spain 3 Barcelona Supercomputing Center {xavier.martorell, alex.ramirez}@bsc.es Abstract. We evaluate the scalability of a Polymorphic Register File us- ing the Conjugate Gradient method as a case study. We focus on a hetero- geneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplica- tion kernel, absolute speedups of up to 200 times can be obtained. More- over, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based sys- tem. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup. 1 Introduction Recent generations of processor designs have reached a point where just in- creasing the clock frequency in order to gain performance is no longer feasible because of power and thermal constraints. As more transistors are available in each generation of CMOS technology, designers have followed two trends in order to improve performance: the specialization of the cores targeting improved per- formance in certain classes of applications and the use of Chip Multi-Processor (CMP) designs in order to extract more performance in multi-threaded appli- cations. Examples of specialized extensions include Single Instruction Multiple Data (SIMD) extensions such as Altivec [9], which are designed to exploit the available Data Level Parallelism, but also the hardware support for the Ad- vanced Encryption Standard [8] which provides improved performance for data encryption. A typical example of a heterogeneous CMP architecture is the Cell