Architectural Support for Efficient Data Movement
in Fully Disaggregated Systems
Christina Giannoula
University of Toronto &
National Technical
University of Athens
christina.giann@gmail.com
Kailong Huang*
University of Toronto
kailong9@gmail.com
Jonathan Tang*
University of Toronto
jonathanj.tang@alum.utoronto.ca
Nectarios Koziris
National Technical
University of Athens
nkoziris@cslab.ece.ntua.gr
Georgios Goumas
National Technical
University of Athens
goumas@cslab.ece.ntua.gr
Zeshan Chishti
Intel Corporation
zeshan.a.chishti@intel.com
Nandita Vijaykumar
University of Toronto
nandita@cs.toronto.edu
1 DATA MOVEMENT IN DISAGGREGATED
SYSTEMS
Traditional data centers include monolithic servers that tightly
integrate CPU, memory and disk (Figure 1a). Instead, Disaggre-
gated Systems (DSs)[8, 13, 18, 27] organize multiple compute (CC),
memory (MC) and storage devices as independent, failure-isolated
components interconnected over a high-bandwidth network (Fig-
ure 1b). DSs can greatly reduce data center costs by providing
improved resource utilization, resource scaling, failure-handling
and elasticity in modern data centers [5, 8–10, 10, 11, 13, 18, 27].
Local
Memory
CPU
Compute
Component
processor
monitor
…
network across servers
Remote
Memory
Controller
memory
monitor
Memory
Component
Remote
Memory
Controller
memory
monitor
Memory
Component
Disk
Controller
disk
monitor
Storage
Component
Local
Memory
CPU
Compute
Component
processor
monitor
network across hardware components
…
… …
Figure 1: (a) Traditional systems vs (b) DSs.
The MCs provide large pools of main memory (remote memory),
while the CCs include the on-chip caches and a few GBs of DRAM
(local memory) that acts as a cache of remote memory. In this
context, a large fraction of the application’s data (∼ 80%) [8, 18, 27]
is located in remote memory, and can cause large performance
penalties from remotely accessing data over the network.
Alleviating data access overheads is challenging in DSs for the
following reasons. First, DSs are not monolithic and comprise inde-
pendently managed entities: each component has its own hardware
controller, and a specialized kernel monitor uses its own functional-
ity to manage the component it runs on (only communicates with
other monitors via network messaging if there is a need to access
remote resources). This characteristic necessitates a distributed and
disaggregated solution that can scale to a large number of indepen-
dent components in the system. Second, there is high variability in
* Kailong Huang and Jonathan Tang have equal contribution.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SIGMETRICS ’23 Abstracts, June 19–23, 2023, Orlando, FL, USA
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0074-3/23/06.
https://doi.org/10.1145/3578338.3593533
remote memory access latencies since they depend on the locations
of the MCs, contention from other jobs that share the same net-
work and MCs, and data placements that can vary during runtime
or between multiple executions. This necessitates a solution that is
robust towards fluctuations in the network/remote memory band-
width and latencies. Third, a major factor behind the performance
slowdowns is the commonly-used approach in DSs [5, 8, 18, 27] of
moving data at page granularity. This approach effectively provides
software transparency, low metadata costs in memory manage-
ment, and high spatial locality in many applications. However, it
can cause high bandwidth consumption and network congestion,
and often significantly slows down accesses to critical path cache
lines in other concurrently accessed pages.
2 PRIOR WORK
Prior works [2–5, 8, 12–14, 18, 19, 24, 27, 28, 30] propose OS ker-
nels, system-level solutions, software management systems, archi-
tectures for DSs. These works do not tackle the data movement
challenge in DSs, and thus our work is orthogonal to them.
Prior works on hybrid systems [1, 6, 7, 15, 17, 20–23, 25, 26, 29]
integrate die-stacked DRAM [16] as DRAM cache of a large main
memory [1, 7, 15] in a monolithic server, and tackle high page move-
ment costs in two-tiered physical memory via page placement/hot
page selection schemes or by moving data at smaller granularity,
e.g., cache line. However, data movement in DSs poses fundamen-
tally different challenges. First, accesses across the network are
significantly slower than within the server, thus intelligent page
placement cannot by itself address these high costs. Second, DSs
incur significant variations in access latencies based on the current
network architecture and concurrent jobs sharing the MCs/network,
thus necessitating an solution primarily designed for robustness to
this variability. Finally, DSs include independently managed MCs
and networks shared by independent CCs running unknown jobs.
Thus, unlike hybrid systems, the solution cannot assume that the
memory management at the MCs can be fully controlled by the
CPU side. Our work is the first to examine the data movement
problem in fully DSs, and design an effective solution for DSs.
3 DAEMON’S KEY IDEAS
DaeMon (Figure 2) is an adaptive and scalable mechanism to allevi-
ate data costs in DSs, consisting of three techniques.
(I) Decoupled Multiple Granularity Data Movement. We in-
tegrate two separate hardware queues to serve data requests from
remote memory at two granularities, i.e., cache line (via the sub-
block queue to LLC) and page (via the page queue to local memory)
5