Expanse: Computing without Boundaries
Architecture, Deployment, and Early Operations Experiences of a Supercomputer Designed for the
Rapid Evolution in Science and Engineering
Ilkay Altintas, Haisong Cai, Trevor Cooper, Christopher Irving, Thomas Hutton, Marty Kandes,
Amitava Majumdar, Dmitry Mishin, Ismael Perez, Wayne Pfeifer, Manu Shantharam, Robert S.
Sinkovits, Subhashini Sivagnanam, Shawn Strande, Mahidhar Tatineni, Mary Thomas, Nicole
Wolter, Michael Norman
University of California, San Diego, La Jolla, CA
ABSTRACT
We describe the design motivation, architecture, deployment, and
early operations of Expanse, a 5 Petafop, heterogenous HPC system
that entered production as an NSF-funded resource in December
2020 and will be operated on behalf of the national community
for fve years. Expanse will serve a broad range of computational
science and engineering through a combination of standard batch-
oriented services, and by extending the system to the broader CI
ecosystem through science gateways, public cloud integration, sup-
port for high throughput computing, and composable systems. Ex-
panse was procured, deployed, and put into production entirely
during the COVID-19 pandemic, adhering to stringent public health
guidelines throughout. Nevertheless, the planned production date
of October 1, 2020 slipped by only two months, thanks to thorough
planning, a dedicated team of technical and administrative experts,
collaborative vendor partnerships, and a commitment to getting an
important national computing resource to the community at a time
of great need.
CCS CONCEPTS
· Computer Systems Organization -> Architectures -> Dis-
tributed Architectures;
KEYWORDS
High performance computing, high throughput computing, science
gateways, scientifc applications, user support
ACM Reference Format:
Ilkay Altintas, Haisong Cai, Trevor Cooper, Christopher Irving, Thomas Hut-
ton, Marty Kandes, Amitava Majumdar, Dmitry Mishin, Ismael Perez, Wayne
Pfeifer, Manu Shantharam, Robert S. Sinkovits, Subhashini Sivagnanam,
Shawn Strande, Mahidhar Tatineni, Mary Thomas, Nicole Wolter, Michael
Norman. 2021. Expanse: Computing without Boundaries: Architecture,
Deployment, and Early Operations Experiences of a Supercomputer De-
signed for the Rapid Evolution in Science and Engineering. In Practice
This work is licensed under a Creative Commons Attribution International
4.0 License.
PEARC ’21, July 18–22, 2021, Boston, MA, USA
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8292-2/21/07.
https://doi.org/10.1145/3437359.3465588
and Experience in Advanced Research Computing (PEARC ’21), July 18–
22, 2021, Boston, MA, USA. ACM, New York, NY, USA, 4 pages. https:
//doi.org/10.1145/3437359.3465588
1 SYSTEM ARCHITECTURE
High-performance computing system architectures, software, and
expertise are evolving to support the growing diversifcation in
computing that is being driven by rapidly evolving science and
engineering research. Now more than ever, supercomputers must
be part of a more integrated, national cyberinfrastructure that com-
prises distributed computing and data resources, scientifc instru-
ments, R&E networks, and expertise. Accordingly, systems and
application software, and user support must also evolve to support
this expanding ecosystem [1, 2].
Developed in response to NSF Solicitation 19-534, Expanse is an
evolutionary system, designed in large part from lessons learned
from the operation of Comet [3, 4], a supercomputer that has been
operated by SDSC for the last six years. Like Comet, Expanse was
designed to support the łlongtailž of computing, which we defne
as the broad spectrum of computational science and engineering
research that is carried out at modest scale, but with increasingly
diverse system and application software. A summary of the major
subsystems is given in Table 1
Notable features of this design include: frst large-scale NSF
system to feature AMD EPYC processors; 13 identical Scalable Units,
each with 56 compute and 4 GPU nodes; rich storage environment;
full-bisection, low-latency Mellanox HDR-100 interconnect at the
rack level, accessing 7,168 EPYC cores, and 16 V100 GPUs; support
for Slurm and Kubernetes; integration with the Open Science Grid;
scheduler-based integration with public cloud; and support for
composable systems. The system includes a Lustre fle system for
compute and a Ceph fle system, built from repurposed hardware
from the Comet system that will be retired in July 2021, and will
support the composable systems and cloud integration capability,
and limited second copy data.
2 ACQUISTION AND DEPLOYMENT
2.1 Project Changes
Following a thorough assessment of the original design in light
of vendor developments following the award and the potential
impact on the planned deployment, a decision was made to execute
a Project Change Request (PCR) for the standard compute node
processor (the Expanse Risk Mitigation Plan had a provision for