Towards optimal scheduling policy for heterogeneous memory architecture in many-core system Geunchul Park 1 • Seungwoo Rho 1 • Jik-Soo Kim 2 • Dukyun Nam 1 Received: 26 January 2018 / Revised: 19 June 2018 / Accepted: 17 July 2018 Ó Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract With the advent of Intels second-generation many-core processor (Knights Landing: KNL), high-bandwidth memory (HBM) with potentially ﬁve times more bandwidth than existing dynamic random-access memory has become available as a valuable computing resource for high-performance computing (HPC) applications. Therefore, resource management schemes should now be able to consider existing central processing unit cores, conventional main memory, and this newly available HBM to improve the overall system throughput and user response time. In this paper, we present our proﬁling mechanism and related scheduling policy that analyzes the resource usage patterns of various HPC workloads. By carefully allocating memory-intensive workloads to HBM in KNL, we show that the overall performance of multiple message passing interface workloads can be improved in terms of the execution time and system utilization. We evaluate and verify the effectiveness of our scheme for optimizing the use of HBM by using NAS Parallel Benchmarks. Keywords Many core Á High-bandwidth memory (HBM) Á Scheduling Á Parallel program 1 Introduction Advances in technology have contributed to the availability of high-performance computing (HPC) for a variety of applications. In addition, parallel processing technology has been developed to meet the ever-growing demands of HPC. Speciﬁcally, many-core processing has emerged as a means of avoiding the problems associated with a high central processing unit (CPU) clock speed due to the lim- itations of integrated circuits. In recent years, the number of cores per processor has been rapidly increasing, and HPC systems have been implemented in various forms, e.g., accelerators (such as Intel Xeon Phi), graphics pro- cessing units (GPUs), and ﬁeld-programmable gate arrays (FPGAs), to provide low-power and high-performance computing environments. Such many-core processors are continuously evolving to achieve higher performance. For example, Knights Land- ing (KNL) [1], Intels second-generation many-core pro- cessor, contains on-package high-bandwidth memory (HBM) called multichannel dynamic random-access memory (MCDRAM) [2, 3] that has a bandwidth that is ﬁve times greater than that of the existing main memory (DRAM). In contrast to Knights Corner, i.e., the ﬁrst- generation many-core processor, which is provided only as an accelerator, KNL can be used as a self-hosted processor. It is important to analyze the characteristics and workloads of these processors to achieve the best performance. We have little knowledge regarding the optimal usage of the new features of hybrid memory systems that consist of both on-package memory and existing DRAM, as the ﬁrst ver- sions of these systems have been introduced into the market only recently. For example, one method for such optimal usage is the determination of a scheduling policy that is efﬁciently executed on a speciﬁc processor. & Dukyun Nam dynam@kisti.re.kr Geunchul Park gcpark@kisti.re.kr Seungwoo Rho seungwoo0926@kisti.re.kr Jik-Soo Kim jiksoo@mju.ac.kr 1 National Institute of Supercomputing and Networking, KISTI, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, Korea 2 Department of Computer Engineering, Myongji University, 116 Myongji-ro, Cheoin-gu, Yongin, Gyeonggi-do, Korea 123 Cluster Computing https://doi.org/10.1007/s10586-018-2825-4