Exploring Small-Scale and Large-Scale CMP Architectures for Commercial Java Servers Abstract As we enter the era of chip multiprocessor (CMP) architectures, it is important that we explore the scaling characteristics of mainstream server workloads on these platforms. In this paper, we analyze the performance of an Enterprise Java workload (SPECjbb2005) on two important classes of CMP architectures. One class of CMP platforms comprise of “small-scale” CMP (SCMP) processors with a few large out-of-order cores on the die. Another class of CMP platforms comprise of “large- scale” CMP (LCMP) processor(s) with several small in- order cores on the die. For these classes of CMP architectures to succeed, it is important that there are sufficient resources (cache, memory and interconnect) to allow for a balanced scalable platform. In this paper, we focus on evaluating the resource scaling characteristics (cores, caches and memory) of SPECjbb2005 on these two architectures and understanding architectural trade-offs that may be required in future CMP offerings. The overall evaluation is uniquely conducted using four different methodologies (measurements on latest platforms, trace- based cache simulation, trace-based platform simulation and execution-driven emulation). Based on our findings, we summarize the architectural recommendations for future CMP server platforms (e.g. the need for large DRAM caches). 1. INTRODUCTION CMP architectures [25] have become the norm for client and server platforms as these platforms [1, 10, 11, 17, 18] enter the marketplace and gain widespread adoption. The existing CMP offerings differ in number of cores, cache/memory architecture and interconnect choices. As a result, it is very difficult to understand the architectural trade-offs currently being made. Without a detailed evaluation of client and server workload characteristics, it is also difficult to grasp how CMP platforms will evolve in the future and what trade-offs will need to be considered as the CMP architecture space matures. In this paper, our goal is to address this issue by evaluating the performance of a commercial server workload on current and future CMP architectures. By doing so, we hope to identify critical challenges in future CMP architectures as well as point to potential emerging solutions. In the server space, there are two major classes of CMP architectures: small-scale (SCMP) and large-scale (LCMP) architectures. SCMP platforms comprise of “small-scale” CMP processors with a few large out-of- order cores on the die. Recent offerings of SCMP platforms are described in [1] and [10]. LCMP platforms comprise of “large-scale” CMP processor(s) with several small in-order cores on the die. These platforms target the throughput computing segment where recent offerings such as Niagara [17] and Azul [2] are prime examples. The successful evolution of these CMP architectures depends not only on the ability to integrate more cores, but also heavily on the available platform resources (cache, memory and interconnect). For example, the current SCMP server platforms have large caches since there are only two cores on the die (e.g. Intel® Core™ 2 Duo processor [10]), whereas the existing LCMP platform has a small shared cache since the die space is occupied by eight cores (e.g. Sun’s Niagara [17]). As a result, the LCMP platform was required to support higher memory bandwidth and lower number of sockets. In this paper, we will study the resource scaling characteristics of both SCMP and LCMP architectures using a commercial server workload. Our analysis of SCMP and LCMP architectures is based on the SPECjbb2005 benchmark [27]. We chose SPECjbb2005 since managed runtime applications based on Java are increasingly used in the server domain and since it is heavily used by CPU and platform developers to understand the performance of commercial servers. Some of the recent studies using SPECjbb [26, 27] include [6, 15, 21, 22, 23, 24]. While several studies have used SPECjbb2000 (and a handful with SPECjbb2005) for architectural evaluation, this paper is the first to study the performance of SPECjbb2005 on both small-scale and large-scale CMP architectures. In order to accomplish this, we took advantage of four different methodologies: (1) measurements on a state-of-the-art Intel server platform, (2) trace-based functional cache simulations, (3) trace- based CPU and platform simulations and (4) execution- driven emulation using an FPGA prototype. In this paper, we will describe how these methodologies allowed us to cover the SCMP/LCMP design space effectively. Overall, the primary contribution of this paper is the detailed resource scaling study of SPECjbb2005 on SCMP and LCMP architectures (including core scaling effects, cache scaling effects, memory scaling effects and the potential benefits of emerging bandwidth/latency R. Iyer, M. Bhat*, L. Zhao, R. Illikkal, S. Makineni, M. Jones*, K. Shiv*, D. Newell Systems Technology Laboratory *Software Solutions Group Intel Corporation Intel Corporation