A Flexible Simulator of Pipelined Processors Ben Juurlink Koen Bertels Bei Li Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology P.O. Box 5031, 2600 GA Delft, The Netherlands Phone: +31 15 27 81572, fax: +31 15 27 84898 benj@ce.et.tudelft.nl Abstract A flexible, parameterizable simulator of pipelined processors is presented. The simulator allows to configure several (micro-)architectural features such as the pipeline depth, the stage in which branch execution occurs, whether or not register file forwarding is performed, and the number of branch delay slots. We use the simulator to perform experiments with three synthetic benchmarks: vector addition, vector summation, and sum of absolute differences. These kernels are representative for data parallel loops, reduction operations, and benchmarks containing many hard to predict branches, respectively. Keywords: simulation, pipelined processors, microarchitecture, embedded processors, energy reduc- tion. 1 Introduction Since superscalar processors are very power hungry, the core of many embedded systems is an in- order issue, pipelined processor. This is exemplified by the ARM processors: the ARM7 implements a 3-stage pipeline, the ARM10 has 5 stages, and the ARM11 8 stages. The optimal number of pipeline stages usually depends on the application. If it performs many independent operations, a deep pipeline is preferable and no forwarding datapaths are needed. If operations are dependent, a shorter pipeline is preferable and forwarding may be required to avoid stalls. Furthermore, if energy consumption is a concern, a deep pipeline is favorable because deep pipelines often translate to lower supply voltages and, hence, reduced energy consumption. In order to investigate these trade-offs, a simulator is needed that allows to configure the pipeline depth, the forwarding datapaths, etc. However, to our knowledge, there does not exist a simulator that allows to configure the microarchitecture. For example, in [7] the authors propose to pipeline cache accesses to reduce the cache supply voltage and, thereby, energy consumption, but to evaluate their proposal they had to modify the MARS simulator [2]. In this paper we present such a flexible, configurable simulator. As is typical for load/store RISC architectures, it is assumed that the execution of every instruction passes through the following steps: Instruction Fetch, Instruction Decode, Execute, Memory Access, Write-Back. Each of these steps (or super-stages) may be split up in an arbitrary number of sub-stages. For example, pipelined caches can be simulated by specifying that the Instruction Fetch and Memory Access super-stages consist of 483