HAsim: FPGA-Based High-Detail Multicore Simulation Using Time-Division Multiplexing Michael Pellauer * , Michael Adler † , Michel Kinsy * , Angshuman Parashar † , Joel Emer *† * Computation Structures Group Computer Science and A.I. Lab Massachusetts Institute of Technology {pellauer, mkinsy, emer}@csail.mit.edu † VSSAD Group Intel Corporation {michael.adler, angshuman.parashar, joel.emer}@intel.com Abstract—In this paper we present the HAsim FPGA- accelerated simulator. HAsim is able to model a shared- memory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and on-chip network. We compare our time- multiplexed approach to a direct implementation, and present a case study that motivates why high-detail simu- lations should continue to play a role in the architectural exploration process. Index Terms—Simulation, Modeling, On-Chip Networks, Field-Programmable Gate Arrays, FPGA I. I NTRODUCTION Gaining micro-architectural insight relies on the archi- tect’s ability to simulate the target system with a high degree of accuracy. Unfortunately, accuracy comes at the cost of simulator performance—the simulator must em- ulate more detailed hardware structures on every cycle, thus simulated cycles-per-second decreases. Naturally, there is a temptation to reduce the detail of the model in order to facilitate efﬁcient simulation. Typical simulator abstractions include ignoring wrong-path instructions, or replacing core pipelines with abstract models. While such low-ﬁdelity models can help greatly with initial pathﬁnding, the best way for computer architects to convince skeptical colleagues remains a cycle-by-cycle simulation of a realistic core pipeline, cache hierarchy, and on-chip network (OCN). While parallelizing the simulator can recover some performance, parallel simulators have found their perfor- mance limited by communication between the cores on the OCN, and have been forced to reduce ﬁdelity in the OCN in order to achieve reasonable parallelism [1], [2], [3]. In this paper we advocate an alternative approach— hosting the simulator on a reconﬁgurable logic platform. This is facilitated by an emerging class of products that allow a Field Programmable Gate Array (FPGA) to be added to a general-purpose computer via a fast link such as PCIe [4], HyperTransport [5], or Intel Front-Side Bus [6]. On an FPGA, adding detail to a model does not necessarily degrade performance. For example, adding a complex reorder buffer (ROB) to an existing core uses more of the FPGA’s resources, but the ROB and the rest of the core will be simulated simultaneously during a tick of the FPGA’s clock. Similarly, communication within an FPGA is fast, so there is great incentive to ﬁt interacting structures like cores, caches, and OCN routers onto the same FPGA. In this paper we present HAsim, a novel FPGA- accelerated simulator that is able to simulate a multicore with a high-detail pipeline, cache hierarchy, and detailed on-chip network using a single FPGA. HAsim is able to accomplish this via several contributions to efﬁcient scaling that are detailed in this paper. First, we present a ﬁne-grained time-multiplexing scheme that allows a single physical pipeline to act as a detailed timing-model for a multicore. Second, we extend the ﬁne-grained mul- tiplexing scheme to the on-chip network via a novel use of permutations. We generalize our technique to any pos- sible OCN topology, including heterogeneous networks. We compare HAsim’s time-multiplexing approach to a direct implementation on an FPGA. Finally, we use HAsim to study the degree that realism in the core model can affect OCN simulation results in a shared-memory multi-core, an argument for the continued value of high- detail simulation in the architectural exploration process. This paper only considers a single FPGA accelerator. A complementary technique for scaling simulations is to partition the model across many FPGAs. However we do not consider this a limitation, as in order to maximize capacity of the multi-FPGA scenario we must ﬁrst maximize utilization of an individual FPGA.