End-to-End Data Solutions for Distributed Petascale Science Jennifer M. Schopf 1,2 , Ann Chervenak 3 , Ian Foster 1,2,4 , Dan Fraser 1,2 , Dan Gunter 5 , Nick LeRoy 6 , Brian Tierney 5 1 Computation Institute, University of Chicago and Argonne National Laboratory 2 Mathematics and Computer Science Division, Argonne National Laboratory 3 Information Sciences Institute, University of Southern California 4 Department of Computer Science, University of Chicago 5 Lawrence Berkeley National Laboratory 6 Department of Computer Science, University of Wisconsin 1. Petascale Science is an End-to-end Problem Petascale science is an end-to-end endeavor, involving not only the creation of massive datasets at supercomputers or experimental facilities, but the subsequent analysis of that data by a user community that may be distributed across many laboratories and universities. The new Center for Enabling Distributed Petascale Science (CEDPS) supported by the US Department of Energy’s Scientific Discovery through Advanced Computing (SciDAC) program is developing tools to support this end-to-end process. In this brief article, we summarize the goals of the project and its progress to date. Some material is adapted from a longer article that appeared in the 2007 SciDAC conference proceedings [7]. At a recent workshop on computational science, the chair noted in his introductory remarks that if the speed of airplanes had increased by the same factor as computers over the last 50 years, namely five orders of magnitude, then we would be able to cross the US in less than a second. This analogy communicates with great effectiveness the remarkable impact of continued exponential growth in computational performance, which along with comparable improvements in solution methods is arguably the foundation for SciDAC. However, a participant was heard to exclaim following these remarks: “yes—but it would still take two hours to get downtown!” The serious point that this speaker was making is that science is an end- to-end problem and that accelerating just one single aspect of the problem solving process can inevitably achieve only limited returns in terms of increased scientific productivity. These concerns become particularly important as we enter the era of petascale science, by which we mean science involving numerical simulations performed on supercomputers capable of a petaflop/sec or higher performance, and/or experimental apparatus—such as the Large Hadron Collider [4], light sources and other user facilities [1], and ITER [3]—capable of producing petabytes of data. Successful science using such devices demands not only that we be able to construct and operate the simulation or experiment, but also that a distributed community of participants be able to access, analyze, and ultimately make sense of the resulting massive datasets. In the absence of appropriate solutions to the end-to-end problem, the utility of these unique apparatus can be severely compromised. The following example illustrates issues that can arise in such contexts. A team at the University of Chicago recently used the FLASH3 code to perform the world’s largest compressible homogeneous isotropic turbulence simulation [15]. Using 11 million CPU-hours on the LLNL BG/L computer over a period of a week, they produced a total of 154 terabytes of data, contained in 75 million files which subsequently were archived. Subsequently, they used GridFTP to move 23 terabytes of this data to computers at the University of Chicago; using four parallel streams, this took some three weeks at around 20 megabyte/sec. Next, they spent considerable time using local resources to tag the data, analyze, and visualize the data, augmenting the metadata as well. In a final step, they are making this unique dataset available for use by the community of turbulence researchers by providing analysis services so that other researchers can securely download portions of the data for their own use. In each