RackMem: A Tailored Caching Layer for Rack Scale Computing Changyeon Jo, Hyunik Kim, Hexiang Geng, Bernhard Egger Seoul National University Seoul, Republic of Korea {changyeon,hyunik,hexiang,bernhard}@csap.snu.ac.kr ABSTRACT High-performance computing (HPC) clusters sufer from an overall low memory utilization that is caused by the node-centric memory allocation combined with the variable memory requirements of HPC workloads. The recent provisioning of nodes with terabytes of memory to accommodate workloads with extreme peak memory requirements further exacerbates the problem. Memory disaggrega- tion is viewed as a promising remedy to increase overall resource utilization and enable cost-efective up-scaling and efcient oper- ation of HPC clusters, however, the overhead of demand paging in virtual memory management has so far hindered performant implementations. To overcome these limitations, this work presents RackMem, an efcient implementation of disaggregated memory for rack scale computing. RackMem addresses the shortcomings of Linux’s demand paging algorithm and automatically adapts to the memory access patterns of individual processes to minimize the inherent overhead of remote memory accesses. Evaluated on a cluster with an Infniband interconnect, RackMem outperforms the state-of-the-art RDMA implementation and Linux’s virtual mem- ory paging by a signifcant margin. RackMem’s custom demand paging implementation achieves a tail latency that is two orders of magnitude better than that of the Linux kernel. Compared to the state-of-the-art remote paging solution, RackMem achieves a 28% higher throughput and a 44% lower tail latency for a wide variety of real-world workloads. CCS CONCEPTS · Computer systems organization Cloud computing; · Soft- ware and its engineering Virtual memory; Distributed mem- ory; Cloud computing. KEYWORDS Resource disaggregation; Remote memory; High-performance com- puting; Virtualization ACM Reference Format: Changyeon Jo, Hyunik Kim, Hexiang Geng, Bernhard Egger. 2020. Rack- Mem: A Tailored Caching Layer for Rack Scale Computing. In Proceedings of the 2020 International Conference on Parallel Architectures and Compilation Techniques (PACT ’20), October 3ś7, 2020, Virtual Event, GA, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3410463.3414643 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. PACT ’20, October 3ś7, 2020, Virtual Event, GA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8075-1/20/10. . . $15.00 https://doi.org/10.1145/3410463.3414643 Figure 1: Average memory per core of the fastest 50 systems 1 from the TOP500 list (June 2020) [51]. 1 INTRODUCTION Recent years have brought an increasing demand for applications in the parallel and high-performance computing (HPC) domain. Ap- plications not only comprise core computer science workloads such as in-memory databases or machine learning, but extend across a broad range of science and engineering disciplines such as bioinfor- matics, climate science, material science, and high-energy physics. The memory system of an HPC cluster plays an important role in accommodating these workloads in terms of performance, cost, and energy consumption [54]. Studies analyzing the memory uti- lization of HPC clusters fnd three important characteristics: frst, HPC workloads exhibit a bimodal distribution in memory capac- ity requirements with several workloads requiring over 4 GB per core [54]. Second, the memory usage over time across the nodes in an HPC system shows a large variation ranging from a few hun- dreds of megabytes up to tens of gigabytes [32]. Third, the working set of a workload is typically signifcantly smaller than its peak memory requirements [9]. These characteristics make it difcult to determine the ‘optimal’ memory size in an HPC cluster (Figure 1), leading to low average utilization of the available resources [9, 20]. Recent advances in high-speed interconnects [38] allow for a paradigm shift away from isolated to disaggregated hardware re- sources [11, 31, 33, 42, 44, 50]. Sharing resources such as processors, memory, or storage over a fast network [1, 6, 24, 29, 30, 48] has the potential to improve resource utilization through fexible allocation that is not possible in server-centric architectures. Memory disaggregation has been proposed to process big data and in-memory workloads on commodity servers with moderate amounts of physical memory [2, 20]. Disaggregating a low-latency high-throughput resource such as DRAM over a network, how- ever, is a challenging task. Despite improvements in fast optical networks, access latency and throughput are still one to two or- ders of magnitude below that of local memory [2, 38]. Exploiting the principle of locality, existing implementations employ the lo- cal memory as a cache for remote memory. Remote memory is 1 The fastest 50 systems providing memory information.