Job and Data Clustering for Aggregate Use of Multiple Production Cyberinfrastructures Ketan Maheshwari Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA ketan@mcs.anl.gov Allan Espinosa Department of Computer Science University of Chicago Chicago, IL USA aespinosa@cs.uchicago.edu Daniel S. Katz Computation Institute University of Chicago & Argonne National Laboratory Chicago, IL USA d.katz@ieee.org Michael Wilde Computation Institute University of Chicago & Argonne National Laboratory Chicago, IL USA wilde@mcs.anl.gov Zhao Zhang Computation Institute University of Chicago Chicago, IL USA zhaozhang@uchicago.edu Ian Foster Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL USA foster@mcs.anl.gov Scott Callaghan Southern California Earthquake Center University of Southern California Los Angeles, CA USA scottcal@usc.edu Phillip Maechling Southern California Earthquake Center University of Southern California Los Angeles, CA USA maechlin@usc.edu ABSTRACT In this paper, we address the challenges of reducing the time- to-solution of the data intensive earthquake simulation work- flow “CyberShake” by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed re- sources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O charac- teristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of compu- tation (over HPC and distributed resources) and job clus- tering. We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More com- plete performance results will be presented in the final sub- mission of this paper. Copyright 2012 Association for Computing Machinery. ACM acknowl- edges that this contribution was authored or co-authored by an employee, contractor or affiliate of the U.S. Government. As such, the Government re- tains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. DIDC’12, June 19, 2012, Delft, The Netherlands. Copyright 2012 ACM 978-1-4503-1341-4/12/06 ...$10.00. Categories and Subject Descriptors J.2 [Computer Applications]: Earth and atmospheric sci- ences; Engineering—SCR General Terms Theory and Implementation Keywords swift, parallel, hpc, scec 1. INTRODUCTION Our work aggregates XSEDE 1 and Open Science Grid (OSG) to run the CyberShake application [4,12] faster than on XSEDE alone. XSEDE is most often used as a dis- tributed collection of high-performance systems, where small numbers of large parallel jobs are run, usually each on a single system. OSG is a distributed collection of resources that is most often used for large number of small, high- throughput, location-independent jobs. While these two in- frastructures can both be accessed through common grid interfaces (e.g., GRAM to submit jobs, GridFTP to move data), they are disjoint in terms of access procedures, opera- tions, support, and the types of jobs (e.g., parallel vs. serial) for which their resources are optimized. This results in lost 1 We use XSEDE here to refer to the TACC Ranger HPC system, wide area network, etc., which were called Tera- Grid when this work started and since have transitioned into XSEDE, the “Extreme Science and Engineering Discovery Environment”.