Pipelining/Overlapping Data Transfer for Distributed Data-Intensive Job Execution Eun-Sung Jung, Ketan Maheshwari, Rajkumar Kettimuthu Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439 Email: {esjung, ketan, kettimut}@mcs.anl.gov Abstract—Scientific workflows are increasingly gaining atten- tion as both data and compute resources are getting bigger, heterogeneous, and distributed. Many scientific workflows are both compute intensive and data intensive and use distributed resources. This situation poses significant challenges in terms of real-time remote analysis and dissemination of massive datasets to scientists across the community. These challenges will be exacerbated in the exascale era. Parallel jobs in scientific workflows are common, and such parallelism can be exploited by scheduling parallel jobs among multiple execution sites for enhanced performance. Previous scheduling algorithms such as heterogeneous earliest finish time (HEFT) did not focus on scheduling thousands of jobs often seen in contemporary applications. Some techniques, such as task clustering, have been proposed to reduce the overhead of scheduling a large number of jobs. However, scheduling massively parallel jobs in distributed environments poses new challenges as data movement becomes a nontrivial factor. We propose efficient parallel execution models through pipelined execution of data transfer, incorporating network bandwidth and reserved resources at an execution site. We formally analyze those models and suggest the best model with the optimal degree of parallelism. We implement our model in the Swift parallel scripting paradigm using GridFTP. Experiments on real distributed computing resources show that our model with optimal degrees of parallelism outperform the current parallel execution model by as much as 50% reduction of total execution time. I. I NTRODUCTION Scientific workflows have gained prominence as tools of choice for running multistaged computations on large, hetero- geneous, and distributed resources. Many scientific workflows [1], [2] are both compute intensive and data intensive and use distributed resources. This situation poses significant chal- lenges in terms of real-time remote analysis and dissemination of massive datasets to scientists across the community. These challenges will be exacerbated in the exascale era. Parallelism is a common occurence in various scientific workflow patterns [3]. This parallelism can be classified into two broad classes: 1) Workflow parallelism, which occurs because of the presence of independent branches in a workflow. 2) Data parallelism, which occurs because of a need to execute a workflow repeatedly for multiple datasets. Such parallelism can be exploited by distributing parallel jobs among multiple execution sites for enhanced perfor- mance. Previous scheduling algorithms such as heterogeneous earliest finish time (HEFT) [4] did not focus on scheduling par- allel jobs that can amount to a few thousand. Some techniques, such as task clustering [5], have been proposed to reduce the overhead of scheduling a large number of jobs. Despite these opportunities of parallelism, scheduling mas- sively parallel jobs in distributed environments poses new challenges as data movement becomes a nontrivial factor. Distributing parallel jobs among execution sites depending on their computing capacities is not enough for efficient job execution. A job can be naturally divided into three steps: stage-in, execution, and stage-out. Stage-in and stage-out steps refers to the input and output of data at an execution site. This involves using data transfer mechanisms such as sockets over TCP, specialized tool like GridFTP [6], etc. A simple policy of blocking job execution until data for all jobs in an execution step are available can lead to performance degradation. In this paper, we propose efficient parallel execution models through pipelined execution of data transfer via dedicated channels overlapped with execution, incorporating better uti- lization of network bandwidths and reserved resources at an execution site. We formally analyze these models and suggest the best model with an improved exploitation of parallelism. Consequently, the paper makes the following contributions: 1) Review and theoretical analysis of various pipelined execution models for wide-area distributed computing. They were originally designed for tightly coupled archi- tectures such as vector and superscalar machines. 2) Decompose traditional staging and execution cycles of application tasks into distinct jobs to exploit parallelism by overlapping those cycles in a pipeline. 3) Implement and test the above method via Swift par- allel scritping framework [7] on wide-area distributed resources enabled with a third-party GridFTP data trans- fers. The remainder of the paper is structured as follows. In Sec- tion II, we briefly present background on distributed parallel computing. In Section III, we propose several parallel job execution models and present mathematical analysis of those models and corresponding optimal degrees of parallelism. In Section IV, we evaluate our analytical models with actual implementation in the Swift parallel scripting paradigm with