Nikos Hardavellas * , Michael Ferdman †‡ , Anastasia Ailamaki ‡ , Babak Falsafi ‡ The Path Forward: Specialized Computing in the Datacenter *Department of Electrical Engineering and Computer Science, Northwestern University † Department of Electrical and Computer Engineering, Carnegie Mellon University ‡ School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne {nikos@northwestern.edu, mferdman@ece.cmu.edu, anastasia.ailamaki@epfl.ch, babak.falsafi@epfl.ch} ABSTRACT Popular belief holds that the cores on chip will grow at an exponential rate, following Moore’s Law, with a commensu- rate increase in performance. However, by exploring the design space of multicore chips across technologies under a large array of design parameters, we observe that physical constraints in power and off-chip bandwidth prohibit such performance increase. This leads us to conclude that server chips will not scale beyond a few tens of cores, potentially leaving the die real-estate underutilized in future technology generations. We observe that heterogeneous multicores can leverage the die area to overcome the initial power barrier, delivering significantly higher performance for the same off- chip bandwidth and power envelopes. Thus, specialized com- puting, especially when coupled with emerging memory tech- nologies, promises significant increases in performance and energy-efficiency compared to general-purpose computing in the datacenter. INTRODUCTION As Moore’s Law continues for at least another decade, the number of cores on chip will continue to grow at an exponen- tial rate. While workloads with limited parallelism pose per- formance challenges with chip multiprocessors (CMPs), server workloads with abundant parallelism are believed to be immune, capable of scaling to the parallelism available in the hardware. However, contrary to popular belief, despite the inherent scalability in threaded server workloads, increasing core counts cannot directly translate into performance improvements because chips are physically constrained in power and off-chip bandwidth. Multicores are not a panacea for server processor designs. While Moore’s Law enables more transistors on chip [4], the static power consumption of the additional transistors can no longer be mitigated through circuit-level techniques [1]. Although a trade-off exists between cache performance and leakage power, the cache latency cannot be sufficiently reduced to deliver reasonable performance and simultane- ously limit the leakage power. Additionally, the multiplying core counts and thread contexts constitute a substantial frac- tion of the chip’s transistors, steadily raising both static and dynamic core power consumption. While voltage-frequency scaling may lower the dynamic power of the cores and enable more cores on chip, static power dissipation and performance requirements impose a limit. Thus, despite the abundant paral- lelism present in server workloads, server multicore designs are rapidly approaching the power wall. Considering a large array of design parameters, we construct detailed models which conform to ITRS projections of future manufacturing technologies. We jointly optimize supply and threshold voltage, clock frequency, core count, manufacturing process, cache size, and memory technology to conclude that, without a technological miracle, server CMPs will not scale beyond a few tens of cores due to physical power and off-chip bandwidth constraints, leaving the die real-estate underuti- lized. We observe that heterogeneous multicores, by reducing energy waste through specialization, can leverage the die area to overcome the initial power barrier, delivering significantly higher performance under the same physical constraints. Thus, specialized computing shows promise in improving the aggregate performance and energy efficiency of the datacen- ter. This is especially true when heterogeneous CMPs are cou- pled with emerging memory technologies, which mitigate the bandwidth wall and fully expose the CMP to the power wall. METHODOLOGY Complexity and run-time requirements make it impractical to rely on full-system simulation for a large-scale design-space exploration study. Instead, we rely on first-order analytical models of the dominant components, with parameters tuned through full-system simulation. Our algorithm uses the ana- lytical models as constraints, always finding the core count and cache size of the peak-performing design. We model CMPs across four fabrication technologies: 65nm, 45nm, 32nm (due in 2013) and 20nm (due in 2017). For each technology node, we utilize parameters and projections from the International Technology Roadmap for Semiconductors (ITRS) 2008 Edition [4]. In agreement with ITRS, we model bulk planar CMOS for the 65nm and 45nm nodes, ultra-thin- body fully-depleted MOSFETs for 32nm technology, and dou- ble-gate FinFETs for the 20nm node. We model multicore processors running server workloads (i.e., TPC-C, TPC-H and SPECweb) with cores built in one of three ways: general purpose (GPP), embedded (EMB), or spe- cialized (Ideal-P). GPP cores are similar to the cores in Sun UltraSPARC T1. We model 4-way multi-threaded scalar in- order cores, as similar cores have been shown to optimize per- formance for server workloads [2]. Because general-purpose cores consume an inordinate amount of power and area com- pared to embedded cores, we also evaluate cores similar to the ones in ARM11 MPCore. To evaluate the potential of hetero- geneous multicores running server workloads, we also study cores with ASIC-like properties: Ideal-P cores deliver 20x the performance of a GPP core on 1/8th the power under control- intensive workloads [3]. A heterogeneous multicore processor will enable only the Ideal-P cores that most closely match the requirements of the available work, and use GPP cores for non-critical or complex/uncommon parts of the program, thereby exhibiting near-ASIC properties.