Expanse: Computing without Boundaries Architecture, Deployment, and Early Operations Experiences of a Supercomputer Designed for the Rapid Evolution in Science and Engineering Ilkay Altintas, Haisong Cai, Trevor Cooper, Christopher Irving, Thomas Hutton, Marty Kandes, Amitava Majumdar, Dmitry Mishin, Ismael Perez, Wayne Pfeifer, Manu Shantharam, Robert S. Sinkovits, Subhashini Sivagnanam, Shawn Strande, Mahidhar Tatineni, Mary Thomas, Nicole Wolter, Michael Norman University of California, San Diego, La Jolla, CA ABSTRACT We describe the design motivation, architecture, deployment, and early operations of Expanse, a 5 Petafop, heterogenous HPC system that entered production as an NSF-funded resource in December 2020 and will be operated on behalf of the national community for fve years. Expanse will serve a broad range of computational science and engineering through a combination of standard batch- oriented services, and by extending the system to the broader CI ecosystem through science gateways, public cloud integration, sup- port for high throughput computing, and composable systems. Ex- panse was procured, deployed, and put into production entirely during the COVID-19 pandemic, adhering to stringent public health guidelines throughout. Nevertheless, the planned production date of October 1, 2020 slipped by only two months, thanks to thorough planning, a dedicated team of technical and administrative experts, collaborative vendor partnerships, and a commitment to getting an important national computing resource to the community at a time of great need. CCS CONCEPTS · Computer Systems Organization -> Architectures -> Dis- tributed Architectures; KEYWORDS High performance computing, high throughput computing, science gateways, scientifc applications, user support ACM Reference Format: Ilkay Altintas, Haisong Cai, Trevor Cooper, Christopher Irving, Thomas Hut- ton, Marty Kandes, Amitava Majumdar, Dmitry Mishin, Ismael Perez, Wayne Pfeifer, Manu Shantharam, Robert S. Sinkovits, Subhashini Sivagnanam, Shawn Strande, Mahidhar Tatineni, Mary Thomas, Nicole Wolter, Michael Norman. 2021. Expanse: Computing without Boundaries: Architecture, Deployment, and Early Operations Experiences of a Supercomputer De- signed for the Rapid Evolution in Science and Engineering. In Practice This work is licensed under a Creative Commons Attribution International 4.0 License. PEARC ’21, July 18–22, 2021, Boston, MA, USA © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8292-2/21/07. https://doi.org/10.1145/3437359.3465588 and Experience in Advanced Research Computing (PEARC ’21), July 18– 22, 2021, Boston, MA, USA. ACM, New York, NY, USA, 4 pages. https: //doi.org/10.1145/3437359.3465588 1 SYSTEM ARCHITECTURE High-performance computing system architectures, software, and expertise are evolving to support the growing diversifcation in computing that is being driven by rapidly evolving science and engineering research. Now more than ever, supercomputers must be part of a more integrated, national cyberinfrastructure that com- prises distributed computing and data resources, scientifc instru- ments, R&E networks, and expertise. Accordingly, systems and application software, and user support must also evolve to support this expanding ecosystem [1, 2]. Developed in response to NSF Solicitation 19-534, Expanse is an evolutionary system, designed in large part from lessons learned from the operation of Comet [3, 4], a supercomputer that has been operated by SDSC for the last six years. Like Comet, Expanse was designed to support the łlongtailž of computing, which we defne as the broad spectrum of computational science and engineering research that is carried out at modest scale, but with increasingly diverse system and application software. A summary of the major subsystems is given in Table 1 Notable features of this design include: frst large-scale NSF system to feature AMD EPYC processors; 13 identical Scalable Units, each with 56 compute and 4 GPU nodes; rich storage environment; full-bisection, low-latency Mellanox HDR-100 interconnect at the rack level, accessing 7,168 EPYC cores, and 16 V100 GPUs; support for Slurm and Kubernetes; integration with the Open Science Grid; scheduler-based integration with public cloud; and support for composable systems. The system includes a Lustre fle system for compute and a Ceph fle system, built from repurposed hardware from the Comet system that will be retired in July 2021, and will support the composable systems and cloud integration capability, and limited second copy data. 2 ACQUISTION AND DEPLOYMENT 2.1 Project Changes Following a thorough assessment of the original design in light of vendor developments following the award and the potential impact on the planned deployment, a decision was made to execute a Project Change Request (PCR) for the standard compute node processor (the Expanse Risk Mitigation Plan had a provision for