Architectural Support for Efficient Data Movement in Fully Disaggregated Systems Christina Giannoula University of Toronto & National Technical University of Athens christina.giann@gmail.com Kailong Huang* University of Toronto kailong9@gmail.com Jonathan Tang* University of Toronto jonathanj.tang@alum.utoronto.ca Nectarios Koziris National Technical University of Athens nkoziris@cslab.ece.ntua.gr Georgios Goumas National Technical University of Athens goumas@cslab.ece.ntua.gr Zeshan Chishti Intel Corporation zeshan.a.chishti@intel.com Nandita Vijaykumar University of Toronto nandita@cs.toronto.edu 1 DATA MOVEMENT IN DISAGGREGATED SYSTEMS Traditional data centers include monolithic servers that tightly integrate CPU, memory and disk (Figure 1a). Instead, Disaggre- gated Systems (DSs)[8, 13, 18, 27] organize multiple compute (CC), memory (MC) and storage devices as independent, failure-isolated components interconnected over a high-bandwidth network (Fig- ure 1b). DSs can greatly reduce data center costs by providing improved resource utilization, resource scaling, failure-handling and elasticity in modern data centers [5, 8–10, 10, 11, 13, 18, 27]. Local Memory CPU Compute Component processor monitor network across servers Remote Memory Controller memory monitor Memory Component Remote Memory Controller memory monitor Memory Component Disk Controller disk monitor Storage Component Local Memory CPU Compute Component processor monitor network across hardware components Figure 1: (a) Traditional systems vs (b) DSs. The MCs provide large pools of main memory (remote memory), while the CCs include the on-chip caches and a few GBs of DRAM (local memory) that acts as a cache of remote memory. In this context, a large fraction of the application’s data (80%) [8, 18, 27] is located in remote memory, and can cause large performance penalties from remotely accessing data over the network. Alleviating data access overheads is challenging in DSs for the following reasons. First, DSs are not monolithic and comprise inde- pendently managed entities: each component has its own hardware controller, and a specialized kernel monitor uses its own functional- ity to manage the component it runs on (only communicates with other monitors via network messaging if there is a need to access remote resources). This characteristic necessitates a distributed and disaggregated solution that can scale to a large number of indepen- dent components in the system. Second, there is high variability in * Kailong Huang and Jonathan Tang have equal contribution. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). SIGMETRICS ’23 Abstracts, June 19–23, 2023, Orlando, FL, USA © 2023 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-0074-3/23/06. https://doi.org/10.1145/3578338.3593533 remote memory access latencies since they depend on the locations of the MCs, contention from other jobs that share the same net- work and MCs, and data placements that can vary during runtime or between multiple executions. This necessitates a solution that is robust towards fluctuations in the network/remote memory band- width and latencies. Third, a major factor behind the performance slowdowns is the commonly-used approach in DSs [5, 8, 18, 27] of moving data at page granularity. This approach effectively provides software transparency, low metadata costs in memory manage- ment, and high spatial locality in many applications. However, it can cause high bandwidth consumption and network congestion, and often significantly slows down accesses to critical path cache lines in other concurrently accessed pages. 2 PRIOR WORK Prior works [25, 8, 1214, 18, 19, 24, 27, 28, 30] propose OS ker- nels, system-level solutions, software management systems, archi- tectures for DSs. These works do not tackle the data movement challenge in DSs, and thus our work is orthogonal to them. Prior works on hybrid systems [1, 6, 7, 15, 17, 2023, 25, 26, 29] integrate die-stacked DRAM [16] as DRAM cache of a large main memory [1, 7, 15] in a monolithic server, and tackle high page move- ment costs in two-tiered physical memory via page placement/hot page selection schemes or by moving data at smaller granularity, e.g., cache line. However, data movement in DSs poses fundamen- tally different challenges. First, accesses across the network are significantly slower than within the server, thus intelligent page placement cannot by itself address these high costs. Second, DSs incur significant variations in access latencies based on the current network architecture and concurrent jobs sharing the MCs/network, thus necessitating an solution primarily designed for robustness to this variability. Finally, DSs include independently managed MCs and networks shared by independent CCs running unknown jobs. Thus, unlike hybrid systems, the solution cannot assume that the memory management at the MCs can be fully controlled by the CPU side. Our work is the first to examine the data movement problem in fully DSs, and design an effective solution for DSs. 3 DAEMON’S KEY IDEAS DaeMon (Figure 2) is an adaptive and scalable mechanism to allevi- ate data costs in DSs, consisting of three techniques. (I) Decoupled Multiple Granularity Data Movement. We in- tegrate two separate hardware queues to serve data requests from remote memory at two granularities, i.e., cache line (via the sub- block queue to LLC) and page (via the page queue to local memory) 5