REDfISh — REsilient Dynamic dIstributed Scalable System Services for Exascale Hugh Greenberg, Michael Lang, Latchesar Ionkov and Sean Blanchard Los Alamos National Laboratory {hng,mlang,lionkov,seanb}@lanl.gov Abstract—Supercomputers are continually advancing in or- der to solve some of most challenging scientiﬁc problems. The petaﬂop (10 15 ﬂoating point operations per second) perfor- mance milestone has been reached and researchers are now challenged with advancing the number of ﬂoating point opera- tions to 10 18 per second, also known as exascale. Exascale class systems are expected to contain millions of nodes consisting of low powered processor cores connected through multiple interconnects. System software used today was never designed to scale to these types of systems; therefore, a dramatic change is needed for system services to address the challenges of exascale. Services need to be resilient, dynamic, distributed, and scalable in order to scale to this type of system. To address these requirements for future system services, we describe a novel path to creating exascale-ready services by focusing on the key tenets of resilience, dynamic adaption, fully distributed processes. and scalability. We then present a DHCP (Dynamic Host Conﬁguration Protocol) replacement based on this design and compare it to an existing DHCP implementation. We show that the dynamic allocation of services and the ability to absorb errors makes our approach superior to standard services. I. I NTRODUCTION With the introduction of Los Alamos National Labora- tory’s RoadRunner super computer in 2008 [1], the petas- cale (10 15 ﬂoating point operations per second) milestone was reached. Researchers are looking towards the next stage in supercomputer evolution, which is known as exascale (10 18 ﬂoating point operations per second). In order to reach this milestone, high performance computing (HPC) system hardware will need to change signiﬁcantly over the next decade. Technology trends indicate that exascale systems will be comprised of hundreds of millions to billions of heterogeneous cores; each core will have a limited amount of local memory and memory bandwidth, and each processor socket will contain a large number of cores [2]. Current HPC system designs focus a great deal of effort on optimizing interconnects, message passing libraries and I/O networks; however, they suffer from brute force hardware and software solutions for their infrastructure and management. Petascale tools will not scale efﬁciently to support exascale class ma- chines. Many exascale efforts are focusing on applications, algorithms, programming models, and hardware. System services are a less obvious but important issue for exascale. There are many challenges that have been identiﬁed by the scientiﬁc computing community to reaching exascale including: resilience, dynamic adaption, distributed systems of services, scalability and power [3]. These challenges require revolutionary changes in system software for success at exascale. The resilience of system software is expected to be one of the major limiting factors in reaching exascale since these systems are expected to consist of millions of nodes; there- fore, failures will be commonplace. Even current petascale designs suffer from single points of failure with at most a single fail-over server. The system itself needs to be able to diagnose issues and perform basic troubleshooting autonom- ically before requesting human intervention. Planning for failures and by leveraging proven solutions in the large-scale Internet community, we have designed resilience into our system. We have included automatic recovery and mirroring mechanisms into the system such that it would require numerous simultaneous failures across an entire cluster to cause the loss of service to a section of the cluster. In such an eventuality, our software guarantees that remaining services will be in a consistent state. Dynamic response to system load will be key to power savings. Existing petascale systems scale with hardwired brute force techniques. Specially conﬁgured nodes within the system are dedicated for services such as monitoring, boot- ing, resource management, I/O forwarding and job launch. These service nodes must be, by design, overprovisioned to handle the heaviest imaginable load, but sit idle while consuming power when not needed. Instead of dedicating overprovisioned nodes to run all of the system services and infrastructure monitoring, we envision a distributed network of services running on ”spare-cores” that would otherwise be idle in high core count nodes. As demand subsides, cores would be recruited dynamically on demand and returned to availability for other services, or as compute cores. Our services would be tuned to use less memory bandwidth and minimize interference with application performance [4]. Cores running system services could also be clocked at a lower frequency to save power, as system services are usually not CPU intensive. Also, any core can be used to support a service rather than only a single dedicated node, which aids in resilience. Collecting data from each node in an exascale system would result in an unwieldy amount of data. Thus, a single management node will not be able gather and store data from the entire system in a reasonable amount of time or memory. A more manageable approach is to partition the system into regions. Regions will have a manager service