INDENT: I ncremental On line De cision Tree T raining for Domain-Specific Systems-on-Chip Anish Krishnakumar anish.n.krishnakumar@wisc.edu University of Wisconsin-Madison USA Radu Marculescu radum@utexas.edu The University of Texas at Austin USA Umit Ogras uogras@wisc.edu University of Wisconsin-Madison USA ABSTRACT The performance and energy efciency potential of heterogeneous architectures has fueled domain-specifc systems-on-chip (DSSoCs) that integrate general-purpose and domain-specialized hardware accelerators. Decision trees (DTs) perform high-quality, low-latency task scheduling to utilize the massive parallelism and heterogeneity in DSSoCs efectively. However, ofine trained DT scheduling poli- cies can quickly become inefective when applications or hardware confgurations change. There is a critical need for runtime tech- niques to train DTs incrementally without sacrifcing accuracy since current training approaches have large memory and computational power requirements. To address this need, we propose INDENT, an incremental online DT framework to update the scheduling policy and adapt it to unseen scenarios. INDENT updates DT schedulers at runtime using only 1-8% of the original training data embedded during training. Thorough evaluations with hardware platforms and DSSoC simulators demonstrate that INDENT performs within 5% of a DT trained from scratch using the entire dataset and outperforms current state-of-the-art approaches. CCS CONCEPTS Computer systems organization System on a chip. KEYWORDS Domain-specifc system-on-chip, online learning, incremental train- ing, decision trees, task scheduling, resource management, low- power, ultra-low latency. ACM Reference Format: Anish Krishnakumar, Radu Marculescu, and Umit Ogras. 2022. INDENT: I ncremental On line De cision Tree T raining for Domain-Specifc Systems- on-Chip. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD ’22), October 30-November 3, 2022, San Diego, CA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3508352.3549436 1 INTRODUCTION With the slowdown of Moore’s law and Dennard scaling, heteroge- neous processing elements (PEs) have been the primary catalyst for the performance and energy efciency of computing systems [38]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. ICCAD ’22, October 30-November 3, 2022, San Diego, CA, USA © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9217-4/22/10. . . $15.00 https://doi.org/10.1145/3508352.3549436 For example, highly optimized fxed-function hardware accelerators for signal processing and deep learning are commonly used in com- munication and autonomous driving applications [3, 18]. However, performance and energy efciency boosts come at the expense of programming flexibility, as the hardware accelerators are notori- ously hard to program. To address this challenge, domain-specifc systems-on-chip rise as a new class of heterogeneous SoCs [1, 23]. They combine the flexibility of general-purpose cores with the performance and energy efciency of specialized hardware acceler- ators tailored to applications in a target domain [4, 11, 14]. DSSoCs comprise many heterogeneous processing elements, resulting in an ample runtime decision space for task execution. Hence, scheduling algorithms try to identify the most appropriate execution resource to maximize a specifc optimization objective, such as performance, power consumption, or energy-delay product [18, 22, 26, 27, 37, 39]. DSSoCs can execute tasks in the order of nanoseconds due to highly specialized hardware accelerators. Hence, task scheduling algorithms must provide high-quality scheduling decisions at ultra- low latencies [10, 11]. Decision tree (in short, DT) classifers ofer a promising solution since they provide high-quality decisions at low inference latency compared to multi-layer perceptron and deep neural networks. Furthermore, DT policies are simple and easy to interpret [19, 21, 34]. Task scheduling policies designed ofine are optimized for a particular optimization objective, SoC confguration, and set of applications [16, 18, 39]. Therefore, rapidly evolving SoC architectures, emerging applications, and workloads pose a severe risk to fxed scheduling policies. As these parameters change over time, the ofine designed static policies become inefective, lowering the energy efciency potentials of DSSoCs. Hence, there is a critical need for task scheduling policies to adapt to dynamic changes to maximize performance and energy efciency. Existing DT design techniques require the entire dataset to train a new standalone DT [5]. This requirement poses a signifcant drawback compared to other ML models, such as neural networks, since storing all training samples would require signifcant memory on the target platform. Hence, the classical DT training algorithms are impractical for online adaptation. Prior studies tried to address this challenge using reinforcement learning (RL), ensemble trees, and very fast decision trees (e.g., Hoefding trees) [9, 15, 25, 33]. RL techniques sufer from computational power requirements for train- ing [40]; DT ensembles result in higher latency and computational overheads due to several weak learners [13]; fnally, the assump- tions to train Hoefding trees do not hold for online updates [25]. Hence, existing techniques are not applicable for incremental and online DT updates due to the resource constraints and inference latency targets for DSSoCs.