CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2006; 18:609–620 Published online 8 November 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.969 Building reliable and efficient data transfer and processing pipelines T. Kosar ∗,† , G. Kola and M. Livny Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, WI 53706, U.S.A. SUMMARY Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated flow control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures. Copyright c 2005 John Wiley & Sons, Ltd. KEY WORDS: workflows; data pipelines; data transfer; data intensive computing; distributed systems; Grid computing; fault tolerance 1. INTRODUCTION Grid computing [1] has enabled researchers to collaborate more effectively by sharing computing resources. Many fields, including astronomy, genetics, biomedicine and geology, have to transfer large amounts of data for processing and sharing among their collaborating organizations. Each organization either developed or decided to use one particular storage system. For example, NCSA uses UniTree [2], SDSC uses SRB [3], LBNL uses HPSS [4] and Fermi uses Enstore [5] mass storage systems. Collaborating researchers spanning across the different organizations need to transfer data among different storage systems and most of the time have to speak multiple protocols. While reliably transferring large amounts of data over the wide area is in itself difficult, the additional complexity ∗ Correspondence to: T. Kosar, Computer Sciences Department, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, WI 53706, U.S.A. † E-mail: kosart@cs.wisc.edu Copyright c 2005 John Wiley & Sons, Ltd. Received 10 December 2004 Revised 22 February 2005 Accepted 1 March 2005