A Process Management Runtime with Dynamic Reconfiguration
Shinji Sumimoto
sumimoto.shinji@jp.fujitsu.com
Fujitsu Ltd.
Japan
Toshihiro Hanawa
hanawa@cc.u-tokyo.ac.jp
The University of Tokyo
Japan
Kengo Nakajima
nakajima@cc.u-tokyo.ac.jp
The University of Tokyo / RIKEN
Center for Computational Science
Japan
ABSTRACT
This paper proposes DyProReconf, a system runtime that can dy-
namically change the number of processes. By coordinating the sys-
tem software and this DyProReconf runtime system during system
operation, it is possible to fexibly change the system confguration
according to the amount of power used, and to execute priority
jobs even when Urgernt Computing is executed. DyProReconf al-
lows users to dynamically modify a large number of processes from
external input by using user level checkpoint/restart programs and
ULFM(User Level Fault Mitigation) for user process failure. We im-
plemented DyProReconf with a fault injection mechanism by using
ULFM-enabled Open MPI and applied to pHEAT-3D application,
3D unsteady-state heat transfer problems with the fnite element
method (FEM) using iterative linear solvers. The results of eval-
uation show that DyProReconf easily applied to pHEAT-3D, and
the U-pHEAT-3D, pHEAT-3D with DyProReconf, can dynamically
change the number of processes, and continues the calculation for
injected process failures.
KEYWORDS
Dynamic Reconfgurable Computer Center, Urgent Computing,
Checkpoint/Restart,ULFM
ACM Reference Format:
Shinji Sumimoto, Toshihiro Hanawa, and Kengo Nakajima. 2022. A Process
Management Runtime with Dynamic Reconfguration. In International Con-
ference on High Performance Computing in Asia-Pacifc Region Workshops
(HPCAsia 2022 Workshop), January 11ś14, 2022, Virtual Event, Japan. ACM,
New York, NY, USA, 9 pages. https://doi.org/10.1145/3503470.3503473
1 INTRODUCTION
Post exascale systems will be much larger scale systems than current
existing systems because semiconductor performance scaling and
increasing CPU operating clock frequency will be limitated.
As the size of the system increases, it becomes less reliable and
consumes more power. If a system becomes less reliable, the ap-
plication is more likely to go down, so fault tolerance should be
introduced into the application and system software. Moreover, in
order to save power by increasing power consumption, it is neces-
sary to dynamically change the system utilization rate by system
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
HPCAsia 2022 Workshop, January 11ś14, 2022, Virtual Event, Japan
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9564-9/22/01. . . $15.00
https://doi.org/10.1145/3503470.3503473
software. Since resource usage is dynamically calculated hourly
to scale the operation of the system, the application itself needs a
mechanism that can dynamically change the confguration.
However, it is not easy to change the system confguration dy-
namically. This is because most running applications and opera-
tional software do not support dynamic system changes. Therefore,
in order to dynamically change the system confguration, it is nec-
essary to interrupt or stop the application from time to time. In the
future, applications and system software should support dynamic
system reconfguration.
We are currently developing applications and system software
for urgent computing[8, 12]. Urgent computing enables us to predict
damage and plan evacuation in the event of natural disasters such
as earthquakes, foods and tsunamis by using computer simulation.
The urgent computing we envision is based on the premise
of system operation that can dynamically control the amount of
computer resources used, and the computer resources of the system
is coordinated by the application and system software in the same
way as controlling the amount of power used. We plan to control
resource usage and allocate the necessary computer resources for
urgent computing.
The technical challenge in urgent computing on the next-
generation computer center that we are planning is to realize dy-
namic computer resource control by coordinating applications and
system software. At the next-generation computer center, we can
dynamically control changes in the operation scale according to
the power supply, and at the same time, we can respond to urgent
job execution such as urgent computing by dynamically migrating
computer resources.
This paper describes DyProReconf (Dynamic Process Re-
confgurable Runtime), which is a runtime system that can dy-
namically change the number of MPI processes. In this paper, we
describe the computer center operation we are planning and the sys-
tem operation assuming urgent computing realized on the system.
After that, the design, implementation, and evaluation of DyProRe-
conf, which is a runtime system that can dynamically change the
number of MPI processes, will be described.
2 NEXT GENERATION CONPUTER CENTER
OPERATION OVERVIEW AND URGENT
COMPUTING AT THE CENTER
This section describes the requirements of the next-generation com-
puter center that we consider, and describes the urgent computing
implementation method at this computer center.
10