A Process Management Runtime with Dynamic Reconfiguration Shinji Sumimoto sumimoto.shinji@jp.fujitsu.com Fujitsu Ltd. Japan Toshihiro Hanawa hanawa@cc.u-tokyo.ac.jp The University of Tokyo Japan Kengo Nakajima nakajima@cc.u-tokyo.ac.jp The University of Tokyo / RIKEN Center for Computational Science Japan ABSTRACT This paper proposes DyProReconf, a system runtime that can dy- namically change the number of processes. By coordinating the sys- tem software and this DyProReconf runtime system during system operation, it is possible to fexibly change the system confguration according to the amount of power used, and to execute priority jobs even when Urgernt Computing is executed. DyProReconf al- lows users to dynamically modify a large number of processes from external input by using user level checkpoint/restart programs and ULFM(User Level Fault Mitigation) for user process failure. We im- plemented DyProReconf with a fault injection mechanism by using ULFM-enabled Open MPI and applied to pHEAT-3D application, 3D unsteady-state heat transfer problems with the fnite element method (FEM) using iterative linear solvers. The results of eval- uation show that DyProReconf easily applied to pHEAT-3D, and the U-pHEAT-3D, pHEAT-3D with DyProReconf, can dynamically change the number of processes, and continues the calculation for injected process failures. KEYWORDS Dynamic Reconfgurable Computer Center, Urgent Computing, Checkpoint/Restart,ULFM ACM Reference Format: Shinji Sumimoto, Toshihiro Hanawa, and Kengo Nakajima. 2022. A Process Management Runtime with Dynamic Reconfguration. In International Con- ference on High Performance Computing in Asia-Pacifc Region Workshops (HPCAsia 2022 Workshop), January 11ś14, 2022, Virtual Event, Japan. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3503470.3503473 1 INTRODUCTION Post exascale systems will be much larger scale systems than current existing systems because semiconductor performance scaling and increasing CPU operating clock frequency will be limitated. As the size of the system increases, it becomes less reliable and consumes more power. If a system becomes less reliable, the ap- plication is more likely to go down, so fault tolerance should be introduced into the application and system software. Moreover, in order to save power by increasing power consumption, it is neces- sary to dynamically change the system utilization rate by system Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. HPCAsia 2022 Workshop, January 11ś14, 2022, Virtual Event, Japan © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9564-9/22/01. . . $15.00 https://doi.org/10.1145/3503470.3503473 software. Since resource usage is dynamically calculated hourly to scale the operation of the system, the application itself needs a mechanism that can dynamically change the confguration. However, it is not easy to change the system confguration dy- namically. This is because most running applications and opera- tional software do not support dynamic system changes. Therefore, in order to dynamically change the system confguration, it is nec- essary to interrupt or stop the application from time to time. In the future, applications and system software should support dynamic system reconfguration. We are currently developing applications and system software for urgent computing[8, 12]. Urgent computing enables us to predict damage and plan evacuation in the event of natural disasters such as earthquakes, foods and tsunamis by using computer simulation. The urgent computing we envision is based on the premise of system operation that can dynamically control the amount of computer resources used, and the computer resources of the system is coordinated by the application and system software in the same way as controlling the amount of power used. We plan to control resource usage and allocate the necessary computer resources for urgent computing. The technical challenge in urgent computing on the next- generation computer center that we are planning is to realize dy- namic computer resource control by coordinating applications and system software. At the next-generation computer center, we can dynamically control changes in the operation scale according to the power supply, and at the same time, we can respond to urgent job execution such as urgent computing by dynamically migrating computer resources. This paper describes DyProReconf (Dynamic Process Re- confgurable Runtime), which is a runtime system that can dy- namically change the number of MPI processes. In this paper, we describe the computer center operation we are planning and the sys- tem operation assuming urgent computing realized on the system. After that, the design, implementation, and evaluation of DyProRe- conf, which is a runtime system that can dynamically change the number of MPI processes, will be described. 2 NEXT GENERATION CONPUTER CENTER OPERATION OVERVIEW AND URGENT COMPUTING AT THE CENTER This section describes the requirements of the next-generation com- puter center that we consider, and describes the urgent computing implementation method at this computer center. 10