Flexible Architectural Support for Fine-Grain Scheduling Daniel Sanchez, Richard M. Yoo, Christos Kozyrakis Electrical Engineering Department Stanford University {sanchezd,rmyoo,kozyraki}@stanford.edu Abstract To make efﬁcient use of CMPs with tens to hundreds of cores, it is often necessary to exploit ﬁne-grain parallelism. However, man- aging tasks of a few thousand instructions is particularly challeng- ing, as the runtime must ensure load balance without compromis- ing locality and introducing small overheads. Software-only sched- ulers can implement various scheduling algorithms that match the characteristics of different applications and programming models, but suffer signiﬁcant overheads as they synchronize and communi- cate task information over the deep cache hierarchy of a large-scale CMP. To reduce these costs, hardware-only schedulers like Car- bon, which implement task queuing and scheduling in hardware, have been proposed. However, a hardware-only solution ﬁxes the scheduling algorithm and leaves no room for other uses of the cus- tom hardware. This paper presents a combined hardware-software approach to build ﬁne-grain schedulers that retain the ﬂexibility of soft- ware schedulers while being as fast and scalable as hardware ones. We propose asynchronous direct messages (ADM), a simple archi- tectural extension that provides direct exchange of asynchronous, short messages between threads in the CMP without going through the memory hierarchy. ADM is sufﬁcient to implement a family of novel, software-mostly schedulers that rely on low-overhead mes- saging to efﬁciently coordinate scheduling and transfer task infor- mation. These schedulers match and often exceed the performance and scalability of Carbon when using the same scheduling algo- rithm. When the ADM runtime tailors its scheduling algorithm to application characteristics, it outperforms Carbon by up to 70%. Categories and Subject Descriptors C.1.2 [Processor Archi- tectures]: Multiple Data Stream Architectures (Multiprocessors); D.3.4 [Programming Languages]: Processors—Run-time environ- ments General Terms Design, Performance, Algorithms 1. Introduction Chip-multiprocessors (CMPs) are now the mainstream approach to turn the increasing transistor budgets provided by Moore’s Law into performance improvements. General-purpose CMPs with tens of cores are already available [9, 43], and chips with hundreds of cores will be available in the near future [30]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. ASPLOS’10, March 13–17, 2010, Pittsburgh, Pennsylvania, USA. Copyright c  2010 ACM 978-1-60558-839-1/10/03. . . $10.00 To use these large-scale CMPs efﬁciently, a program needs to explicitly divide its work in concurrent tasks and distribute them for execution across the available cores. A key issue is the granularity of this partitioning. This work focuses on programs that use ﬁne-grain parallelism, with tasks as small as a thousand cycles. Fine-grain parallelism has several advantages. First, it can expose more parallelism in many applications, and for some applications parallelism is more easily expressed under this model [33]. This is particularly important for CMPs with hundreds of cores, for which parallelism becomes a precious resource [26]. Second, it gives the underlying runtime system much more freedom in distributing and reassigning work among cores in order to avoid load imbalance in irregular computations, to exploit constructive cache interference among certain tasks [18], or to adapt to environment changes such as cores becoming unavailable due to faults, thermal emergencies, or multiprogramming. On the other hand, ﬁne-grain parallelism may introduce large overheads for representing and distributing small amounts of work, and if tasks are not assigned judiciously, locality across different tasks may be destroyed. Fine-grain parallelism is already supported by several parallel programming models [16, 23, 29]. Their runtime systems typi- cally implement task distribution through work-stealing [13]: each worker thread has a queue of ready to execute tasks, from which it can enqueue or dequeue work. When a thread runs out of tasks, it tries to steal tasks from another thread’s queue. Although this technique works well in general, multiple studies have shown that many applications beneﬁt from different algorithms in terms of the structure of queues, the order of scheduling and stealing, or the granularity of stealing [12, 21, 25]. In short, there is no single best ﬁne-grain scheduler for all applications. Software-only implementations of ﬁne-grain schedulers for such programming models are ﬂexible in terms of the algorithm used. However, they entail high overheads with ﬁne-grain tasks, as queue operations, task stealing, and synchronization introduce commu- nication and contention through the cache hierarchy of the CMP. The latency of a cache line transfer in CMPs with 64 or 128 cores is close to a hundred cycles, so a few such transfers can negate the beneﬁts of parallel execution of ﬁne-grain tasks. Such laten- cies will increase in large-scale CMPs, making ﬁne-grain paral- lelism impractical. To mitigate this problem, Carbon [34] proposes a hardware-only alternative, with specialized hardware queues and a custom messaging protocol for enqueuing, dequeuing and dis- tributing tasks across cores. Hardware implements task stealing and distribution in the background and enables applications with ﬁne-grain parallelism to perform well on large-scale CMPs. A dis- advantage of Carbon is that it introduces a non-trivial amount of custom hardware for the sole purpose of work-stealing. Ideally, we would like to minimize custom hardware structures and implement general primitives that have other uses. Moreover, Carbon ﬁxes the scheduling algorithm in hardware, making it difﬁcult to accelerate an application or programming model that requires a different algo-