MEMS-based Runtime Idle Energy Minimization for Bursty Workloads in Heterogeneous Many-Core Systems Ali Aalsaud 1,2 , Haider Alrudainy 3 , Rishad Shafik 1 , Fei Xia 1 and Alex Yakovlev 1 1 School of EEE, University of Newcastle, Newcastle upon Tyne, NE1 7RU, England, UK 2 School of Engineering, Al-Mustansiriya University, Baghdad, Iraq 3 Basra Engineering Technical College, Southern Technical University, Iraq Email 1 : { A.m.m.aalsaud, Rishad.Shafik, Fei.Xia, Alex.yakovlev }@ncl.ac.uk. Email 3 : h.m.a.alrudainy@stu.edu.iq Abstract—Heterogeneous many-core systems are increasingly being employed in modern embedded applications for high throughput at low energy cost considerations. These applica- tions exhibit bursty workloads that provide with opportunities to minimize system energy. Traditionally, CMOS-based power gating circuitry, consisting of sleep transistors, is used for idle energy reduction in such applications. However, these transistors contribute high leakage current when driving large capacitive loads, making effective energy minimization challenging. In this paper, we propose a novel MEMS-based runtime energy minimization approach. Core to our approach is an integrated sleep mode management based on the performance-energy states and bursty workloads indicated by the performance counters. For effective energy minimization we use a systematic optimization of the controller design parameters by adopting finite element analysis (FEA) in multiphysics COMSOL tool. A number of PAR- SEC benchmark applications are used as case studies of bursty workloads, including CPU- and memory-intensive ones. These ap- plications are exercised on an Exynos 5422 heterogeneous many- core platform showing up to 50% energy savings when compared with ondemand governor. Furthermore, we provide all extensive trade-off analysis to demonstrate the comparative advantages of MEMS-based controller, including zero-leakage current and non- invasive implementations suitable for commercial off-the-shelf systems. I. I NTRODUCTION The impetus of high throughput at low energy cost is at the core of design and implementation of many-core embedded systems. To manage the trade-offs between throughput and energy an effective technique is to allocate heterogeneous com- puting resources on these systems. Exynos 5422 big.LITTLE octa-core platform, which includes 4 big (ARM A15), and 4 LITTLE (ARM A7) cores, is a typical example [1]. Over the years significant research has been carried out to address energy minimization in heterogeneous embedded systems [2]. Such works typically control the core alloca- tion, coupled with dynamic voltage/frequency scaling (DVFS) decisions to react to workload variations [3]. When higher workload is encountered more number of cores are allo- cated with suitably determined DVFS. Conversely, when the workload is lower, fewer cores are executed with reduced voltage/frequency levels. These allocation are managed by a runtime system that interact with the application for workload- based optimizations. 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 No. of idle core (b) Power (watt) 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 No. of idle core (a) Power (watt) Big core Big core Little core Little core Figure 1: Experimental measurements of idle power by adopting Odroid-XU3 big.LITTLE platform (a) 1400MHz big.LITTLE; (b) 2000MHz big, 1400MHz LITTLE. From a core-level viewpoint, continuous runtime controls render bursty workloads, which is characterized by frequent switching between high activity followed by no activity. The period of inactivity leads to idle energy consumption as the clock and supply voltage remain operational. Figure1 depicts the idle power measurements on the Odroid-XU3 big.LITTLE platform for different core allocations and frequencies. The following two observations can be made. Firstly, with increas- ing number of inactive cores (big or LITTLE) the idle power consumption increases. As an example, the idle power of 4 big inactive cores at 2000 MHz is 1 Watt, which drops to 0.8 Watt when only 1 big core is inactive. Secondly, the idle power is also dependant on the operating frequency. For instance, when parallel threads are allocated to LITTLE cores only, the idle power dissipation of 4 big inactive cores rises from 0.39 Watt at 1400 MHz to 1 Watt at 2000 MHz [4]. Idle power contributes to unuseful energy consumption, essentially reducing the battery operational lifetime. To reduce the idle power, the traditional approach is to use power gating. The basic principle is to adopt a number of sleep transistors to disconnect the supply voltage rail for shutting down the inactive cores. Table I summarizes contributions of the existing power gating approaches. A hardware-based stateless load balancing for homogeneous multi-core scheme is evaluated in terms of power and thermal behaviour in [5]. In this approach, a power reduction is achieved by switching off the idle cores. In [6], a sub-clock power gating technique is proposed to reduce static power during the sub-clock cycle of ARM Cortex-M0. This technique uses intrusive redesigning of the power gating paradigm. Among others, Charles et al. [7] implemented per core