JETC: Joint Energy Thermal and Cooling Management for Memory and CPU Subsystems in Servers Raid Ayoub Rajib Nath Tajana Rosing University of California, San Diego La Jolla, CA 92093-0404 Abstract In this work we propose a joint energy, thermal and cooling management technique (JETC) that signiﬁcantly re- duces per server cooling and memory energy costs. Our analysis shows that decoupling the optimization of cooling energy of CPU & memory and the optimization of mem- ory energy leads to suboptimal solutions due to thermal de- pendencies between CPU and memory and non-linearity in cooling energy. This motivates us to develop a holistic so- lution that integrates the energy, thermal and cooling man- agement to maximize energy savings with negligible per- formance hit. JETC considers thermal and power states of CPU & memory, thermal coupling between them and fan speed to arrive at energy efﬁcient decisions. It has CPU and memory actuators to implement its decisions. The memory actuator reduces the energy of memory by performing cool- ing aware clustering of memory pages to a subset of mem- ory modules. The CPU actuator saves cooling energy by reducing the hot spots between and within the CPU sockets and minimizing the effects of thermal coupling. Our exper- imental results show that employing JETC results in 50.7% average energy reduction in cooling and memory subsys- tems with less than 0.3% performance overhead. 1. Introduction Technology scaling coupled with the high demand for computation leads to a wide use of server systems with mul- tiple CPU sockets and larger amounts of memory resulting in higher power densities [1, 2]. High power dissipation increases the operational costs of machines. It also causes thermal hot spots that have substantial effect on reliability, performance and leakage power [25, 7]. Dissipating the ex- cess heat is a big challenge as it requires a complex and energy hungry cooling subsystems. Traditionally, the CPU is known to be the primary source of system power consumption. Over the years the designers have developed techniques to improve the energy efﬁciency of the CPU subsystem which accounts for approximately 50% of the total energy budget. Less attention is given to energy optimization in rest of the system components which leads to a poor energy proportionality of the entire system [11]. Memory subsystem is the other major power hungry component in the server systems as it consumes up to 35% of total system energy and has poor energy proportionality [10, 11]. Capacity and bandwidth of the memory subsystem are typically designed to handle the worst case scenarios. Applications vary signiﬁcantly in terms of their access rates to the memory. It is common that only a fraction of memory pages is active while the rest are dormant. One solution is to activate a subset of the memory modules that can serve the applications needs to save energy [18]. However, this approach increases the power density of the active memory modules which can cause thermal problems in the memory. In [21, 22], the authors proposed a solution to mitigate thermal emergencies in the memory system by adjusting memory throughput to keep the memory temperature in the safe zone. However, this solution does not improve energy proportionality in the memory subsystem since it does not consider minimizing the number of active DIMMs to just what is needed. To manage thermal emergencies within a single CPU socket, a number of core level dynamic ther- mal management (DTM) techniques have been proposed [15, 30]. The beneﬁts of these techniques are limited to managing the temperature within a single socket. The im- portant issue is that none of the existing techniques consider the dynamics of the cooling subsystem and its energy costs. Modern servers incorporate a fan subsystem to reduce their temperature. The power consumed by a fan is cubi- cally related to the air ﬂow rate which makes it energy hun- gry [24]. The fan system in high-end servers consumes as much as 80 Watts in 1U rack servers [24] and 240 Watts or more in 2U rack servers [1]. The cooling subsystem becomes inefﬁcient when it operates far from the optimal point due to poor thermal distribution. Workload alloca- tion policies need to account for cooling characteristics to minimize cooling energy by creating a better thermal distri- bution. Due to cost and area constraints, a common set of