RackMem: A Tailored Caching Layer for Rack Scale Computing
Changyeon Jo, Hyunik Kim, Hexiang Geng, Bernhard Egger
Seoul National University
Seoul, Republic of Korea
{changyeon,hyunik,hexiang,bernhard}@csap.snu.ac.kr
ABSTRACT
High-performance computing (HPC) clusters sufer from an overall
low memory utilization that is caused by the node-centric memory
allocation combined with the variable memory requirements of
HPC workloads. The recent provisioning of nodes with terabytes
of memory to accommodate workloads with extreme peak memory
requirements further exacerbates the problem. Memory disaggrega-
tion is viewed as a promising remedy to increase overall resource
utilization and enable cost-efective up-scaling and efcient oper-
ation of HPC clusters, however, the overhead of demand paging
in virtual memory management has so far hindered performant
implementations. To overcome these limitations, this work presents
RackMem, an efcient implementation of disaggregated memory
for rack scale computing. RackMem addresses the shortcomings
of Linux’s demand paging algorithm and automatically adapts to
the memory access patterns of individual processes to minimize
the inherent overhead of remote memory accesses. Evaluated on a
cluster with an Infniband interconnect, RackMem outperforms the
state-of-the-art RDMA implementation and Linux’s virtual mem-
ory paging by a signifcant margin. RackMem’s custom demand
paging implementation achieves a tail latency that is two orders of
magnitude better than that of the Linux kernel. Compared to the
state-of-the-art remote paging solution, RackMem achieves a 28%
higher throughput and a 44% lower tail latency for a wide variety
of real-world workloads.
CCS CONCEPTS
· Computer systems organization → Cloud computing; · Soft-
ware and its engineering → Virtual memory; Distributed mem-
ory; Cloud computing.
KEYWORDS
Resource disaggregation; Remote memory; High-performance com-
puting; Virtualization
ACM Reference Format:
Changyeon Jo, Hyunik Kim, Hexiang Geng, Bernhard Egger. 2020. Rack-
Mem: A Tailored Caching Layer for Rack Scale Computing. In Proceedings of
the 2020 International Conference on Parallel Architectures and Compilation
Techniques (PACT ’20), October 3ś7, 2020, Virtual Event, GA, USA. ACM, New
York, NY, USA, 14 pages. https://doi.org/10.1145/3410463.3414643
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
PACT ’20, October 3ś7, 2020, Virtual Event, GA, USA
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8075-1/20/10. . . $15.00
https://doi.org/10.1145/3410463.3414643
Figure 1: Average memory per core of the fastest 50 systems
1
from the TOP500 list (June 2020) [51].
1 INTRODUCTION
Recent years have brought an increasing demand for applications
in the parallel and high-performance computing (HPC) domain. Ap-
plications not only comprise core computer science workloads such
as in-memory databases or machine learning, but extend across a
broad range of science and engineering disciplines such as bioinfor-
matics, climate science, material science, and high-energy physics.
The memory system of an HPC cluster plays an important role
in accommodating these workloads in terms of performance, cost,
and energy consumption [54]. Studies analyzing the memory uti-
lization of HPC clusters fnd three important characteristics: frst,
HPC workloads exhibit a bimodal distribution in memory capac-
ity requirements with several workloads requiring over 4 GB per
core [54]. Second, the memory usage over time across the nodes in
an HPC system shows a large variation ranging from a few hun-
dreds of megabytes up to tens of gigabytes [32]. Third, the working
set of a workload is typically signifcantly smaller than its peak
memory requirements [9]. These characteristics make it difcult to
determine the ‘optimal’ memory size in an HPC cluster (Figure 1),
leading to low average utilization of the available resources [9, 20].
Recent advances in high-speed interconnects [38] allow for a
paradigm shift away from isolated to disaggregated hardware re-
sources [11, 31, 33, 42, 44, 50]. Sharing resources such as processors,
memory, or storage over a fast network [1, 6, 24, 29, 30, 48] has the
potential to improve resource utilization through fexible allocation
that is not possible in server-centric architectures.
Memory disaggregation has been proposed to process big data
and in-memory workloads on commodity servers with moderate
amounts of physical memory [2, 20]. Disaggregating a low-latency
high-throughput resource such as DRAM over a network, how-
ever, is a challenging task. Despite improvements in fast optical
networks, access latency and throughput are still one to two or-
ders of magnitude below that of local memory [2, 38]. Exploiting
the principle of locality, existing implementations employ the lo-
cal memory as a cache for remote memory. Remote memory is
1
The fastest 50 systems providing memory information.