Flexible Architectural Support for Fine-Grain Scheduling Daniel Sanchez, Richard M. Yoo, Christos Kozyrakis Electrical Engineering Department Stanford University {sanchezd,rmyoo,kozyraki}@stanford.edu Abstract To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, man- aging tasks of a few thousand instructions is particularly challeng- ing, as the runtime must ensure load balance without compromis- ing locality and introducing small overheads. Software-only sched- ulers can implement various scheduling algorithms that match the characteristics of different applications and programming models, but suffer significant overheads as they synchronize and communi- cate task information over the deep cache hierarchy of a large-scale CMP. To reduce these costs, hardware-only schedulers like Car- bon, which implement task queuing and scheduling in hardware, have been proposed. However, a hardware-only solution fixes the scheduling algorithm and leaves no room for other uses of the cus- tom hardware. This paper presents a combined hardware-software approach to build fine-grain schedulers that retain the flexibility of soft- ware schedulers while being as fast and scalable as hardware ones. We propose asynchronous direct messages (ADM), a simple archi- tectural extension that provides direct exchange of asynchronous, short messages between threads in the CMP without going through the memory hierarchy. ADM is sufficient to implement a family of novel, software-mostly schedulers that rely on low-overhead mes- saging to efficiently coordinate scheduling and transfer task infor- mation. These schedulers match and often exceed the performance and scalability of Carbon when using the same scheduling algo- rithm. When the ADM runtime tailors its scheduling algorithm to application characteristics, it outperforms Carbon by up to 70%. Categories and Subject Descriptors C.1.2 [Processor Archi- tectures]: Multiple Data Stream Architectures (Multiprocessors); D.3.4 [Programming Languages]: Processors—Run-time environ- ments General Terms Design, Performance, Algorithms 1. Introduction Chip-multiprocessors (CMPs) are now the mainstream approach to turn the increasing transistor budgets provided by Moore’s Law into performance improvements. General-purpose CMPs with tens of cores are already available [9, 43], and chips with hundreds of cores will be available in the near future [30]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’10, March 13–17, 2010, Pittsburgh, Pennsylvania, USA. Copyright c 2010 ACM 978-1-60558-839-1/10/03. . . $10.00 To use these large-scale CMPs efficiently, a program needs to explicitly divide its work in concurrent tasks and distribute them for execution across the available cores. A key issue is the granularity of this partitioning. This work focuses on programs that use fine-grain parallelism, with tasks as small as a thousand cycles. Fine-grain parallelism has several advantages. First, it can expose more parallelism in many applications, and for some applications parallelism is more easily expressed under this model [33]. This is particularly important for CMPs with hundreds of cores, for which parallelism becomes a precious resource [26]. Second, it gives the underlying runtime system much more freedom in distributing and reassigning work among cores in order to avoid load imbalance in irregular computations, to exploit constructive cache interference among certain tasks [18], or to adapt to environment changes such as cores becoming unavailable due to faults, thermal emergencies, or multiprogramming. On the other hand, fine-grain parallelism may introduce large overheads for representing and distributing small amounts of work, and if tasks are not assigned judiciously, locality across different tasks may be destroyed. Fine-grain parallelism is already supported by several parallel programming models [16, 23, 29]. Their runtime systems typi- cally implement task distribution through work-stealing [13]: each worker thread has a queue of ready to execute tasks, from which it can enqueue or dequeue work. When a thread runs out of tasks, it tries to steal tasks from another thread’s queue. Although this technique works well in general, multiple studies have shown that many applications benefit from different algorithms in terms of the structure of queues, the order of scheduling and stealing, or the granularity of stealing [12, 21, 25]. In short, there is no single best fine-grain scheduler for all applications. Software-only implementations of fine-grain schedulers for such programming models are flexible in terms of the algorithm used. However, they entail high overheads with fine-grain tasks, as queue operations, task stealing, and synchronization introduce commu- nication and contention through the cache hierarchy of the CMP. The latency of a cache line transfer in CMPs with 64 or 128 cores is close to a hundred cycles, so a few such transfers can negate the benefits of parallel execution of fine-grain tasks. Such laten- cies will increase in large-scale CMPs, making fine-grain paral- lelism impractical. To mitigate this problem, Carbon [34] proposes a hardware-only alternative, with specialized hardware queues and a custom messaging protocol for enqueuing, dequeuing and dis- tributing tasks across cores. Hardware implements task stealing and distribution in the background and enables applications with fine-grain parallelism to perform well on large-scale CMPs. A dis- advantage of Carbon is that it introduces a non-trivial amount of custom hardware for the sole purpose of work-stealing. Ideally, we would like to minimize custom hardware structures and implement general primitives that have other uses. Moreover, Carbon fixes the scheduling algorithm in hardware, making it difficult to accelerate an application or programming model that requires a different algo-