Towards Exascale Computing for High Energy Physics: The ATLAS Experience at ORNL V Ananthraj ∗ , K De † , S Jha ‡¶ , A Klimentov § , D Oleynik † , S Oral ∗ , A Merzky ‡ , R Mashinistov § , S Panitkin § , P Svirin § , M Turilli ‡ , J Wells ∗ , S Wilkinson † ∗ OLCF, Oak Ridge National Laboratory, Oak Ridge, TN, USA † Department of Physics, University of Texas, Arlington, TX, USA ‡ Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ, USA § Department of Physics, Brookhaven National Laboratory, Upton, NY, USA ¶ Computational Science Initiative, Brookhaven National Laboratory, Upton, NY, USA Abstract—Traditionally, the ATLAS experiment at Large Hadron Collider (LHC) has utilized distributed resources as provided by the Worldwide LHC Computing Grid (WLCG) to support data distribution, data analysis and simulations. For example, the ATLAS experiment uses a geographically distributed grid of approximately 200,000 cores continuously (250 000 cores at peak), (over 1,000 million core-hours per year) to process, simulate, and analyze its data (todays total data volume of ATLAS is more than 300 PB). After the early success in discovering a new particle consistent with the long-awaited Higgs boson, ATLAS is continuing the precision measurements necessary for further discoveries. Planned high-luminosity LHC upgrade and related ATLAS detector upgrades, that are neces- sary for physics searches beyond Standard Model, pose serious challenge for ATLAS computing. Data volumes are expected to increase at higher energy and luminosity, causing the storage and computing needs to grow at a much higher pace than the ﬂat budget technology evolution (see Fig. 1). The need for simulation Fig. 1. Estimated ATLAS computing needs vs expected increase in resources on a ﬂat budget. Gray horizontal bars depict LHC data taking periods. and analysis will overwhelm the expected capacity of WLCG computing facilities unless the range and precision of physics studies will be curtailed. To alleviate these challenges, over the past few years, the ATLAS experiment has been investigating the implications of using high-performance computers – such as those found at Oak Ridge Leadership Computing Facility (OLCF). This steady tran- sition is a consequence of application requirements (e.g., greater Fig. 2. ATLAS CPU consumption on Titan from January 2016 to August 2018. Blue bars correspond to resource consumption via backﬁll and no deﬁned allocation, red bars - consumption using ALCC allocation. than expected data production), technology trends and software complexity. Speciﬁcally, the DOE ASCR and HEP funded the BigPanDA project provided the ﬁrst important demonstration of the capabilities that a workload management system (WMS) can have on improving the uptake and utilization of supercomputers from both application and systems points of view. We quantify the impact of this sustained and steady uptake of supercomputer resources via BigPanDA: From January 2017 to September 2018, Big Panda has enabled the utilization of 400 Million Titan core hours primarily via Backﬁll mechanisms 275M, but also through regular front end submission as part of the ALCC project (ASCR Leadership Computing Challenge) 125M. Fig. 2 shows monthly core-hour consumption on Titan by ATLAS jobs. This non-trivial amount of 400 million Titan core hours has resulted in 920 million simulated events delivered to ATLAS physicists for analysis. 3-5% of all of ATLAS compute resources now provided by Titan; other DOE supercomputers also provide signiﬁcant compute allocations. In spite of these impressive numbers, there is a need to further improve the uptake and utilization of current and future supercomputing resources in order to meet ATLAS computing challenges in high luminosity regime. In this short paper, we will outline how we have steadily made the ATLAS project ready for the exascale era. Our approach to the exascale involve the BigPanDA workload management system. BigPanDA is responsible for coordination of tasks, orchestration of resources and job submission and 341 2018 IEEE 14th International Conference on e-Science DOI 10.1109/eScience.2018.00086