Multitasking Workload Scheduling on Flexible-Core Chip Multiprocessors Divya P. Gulati University of Texas at Austin 1 University Station C0500 Austin, Texas 78712 dpgulati@cs.utexas.edu Changkyu Kim Intel Corporation 3600 Juliette Ln, SC12-303 Santa Clara, California 95054 changkyu.kim@intel.com Simha Sethumadhavan Columbia University 1214 Amsterdam Ave. New York, New York 10027 simha@cs.columbia.edu Stephen W. Keckler University of Texas at Austin 1 University Station C0500 Austin, Texas 78712 skeckler@cs.utexas.edu Doug Burger University of Texas at Austin 1 University Station C0500 Austin, Texas 78712 dburger@cs.utexas.edu ABSTRACT While technology trends have ushered in the age of chip mul- tiprocessors (CMP), a fundamental question is what size to make each core. Most current commercial designs are sym- metric CMPs (SCMP) in which each core is identical and range from a simple RISC processor to a complex out-of- order x86 processor. Some researchers have proposed asym- metric CMPs (ACMP) consisting of multiple types of cores. While less of an issue for ACMPs, the fixed nature of both these architectures makes them vulnerable to mismatches between the granularity of the cores and the parallelism in the workload, which can cause inefficient execution. To rem- edy this weakness, recent research has proposed flexible-core CMPs (FCMP), which have the capability of aggregating multiple small processing cores to form larger logical pro- cessors. FCMPs introduce a new resource allocation and scheduling problem which must determine how many logi- cal processors should be configured, how powerful each pro- cessor should be, and where/when each task should run. This paper introduces and motivates this problem, describes the challenges associated with it, and evaluates algorithms appropriate for multitasking on FCMPs. We also evalu- ate static-core CMPs of various configurations and compare them to FCMPs for various multitasking workloads. Categories and Subject Descriptors D.4.1 [Operating Systems]: Process Management— multiprocessing/multiprogramming/multitasking, schedul- ing ; C.1.2 [Processor Architectures]: Multiprocessors— parallel processors General Terms Algorithms, Experimentation, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PACT’08, October 25–29, 2008, Toronto, Ontario, Canada. Copyright 2008 ACM 978-1-60558-282-5/08/10 ...$5.00. 1. INTRODUCTION While technology trends have ushered in the age of chip multiprocessors (CMP) and enabled designers to place an increasing number of cores on chip, a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs (SCMPs) in which each core is identical and range from a relatively simple RISC pipeline to a large and complex out-of-order x86 core. However, the concurrency characteristics of programs demonstrate sub- stantial diversity. For example, the amount of ILP avail- able across different applications may vary widely. Even the characteristics of a single program may vary during different phases of its execution [19]. Selecting the number of cores and their size at design time will result in inefficiencies when the characteristics of the workload do not match the fixed parameters of the system. An alternative to SCMPs are asymmetric chip multiprocessors (ACMPs) which typically comprise multiple processors of different sizes and granular- ities. Such a design allows individual applications or appli- cation phases to be mapped to the processor size best suited to it, resulting in better power efficiency, greater through- put, and better area efficiency than SCMPs. However, the composition of the ACMPs must still be determined at de- sign time, leaving them vulnerable to mismatches between the workload and the system. Recently proposed alternatives to static-core CMPs are a family of flexible-core chip multiprocessors (FCMPs) in which the number and granularity of the processors is deter- mined at runtime through aggregation and configuration [12, 14, 20]. Such designs typically comprise small to moderately sized uniprocessor cores which can execute in parallel as a multitasking/parallel system or which can be aggregated to- gether to form fewer but more powerful uniprocessor cores. The aggregation typically produces a core with higher issue width, a larger instruction window, and more level-1 instruc- tion and data cache capacity. The flexibility of FCMPs pro- vides the opportunity to tailor the hardware to the require- ments of the tasks running on the system, or to co-optimize the software and the configuration of the underlying hard- ware. FCMPs offer a number of advantages over ACMPs, including the opportunity to map a wider range of work- loads, simpler hardware implementation as all of the cores of an FCMP can be identical [12], and better tolerance to per- formance asymmetries resulting from the fixed but varying