A High-Level Distributed Execution Framework for Scientific Workflows Jianwu Wang 1 , Ilkay Altintas 1 , Chad Berkley 2 , Lucas Gilbert 1 , Matthew B. Jones 2 1 San Diego Supercomputer Center, UCSD, U.S.A. {jianwu, altintas, iktome}@sdsc.edu 2 National Center for Ecological Analysis and Synthesis, UCSB, U.S.A. {berkley, jones}@nceas.ucsb.edu Abstract Domain scientists synthesize different data and computing resources to solve their scientific problems. Making use of distributed execution within scientific workflows is a growing and promising way to achieve better execution performance and efficiency. This paper presents a high-level distributed execution framework, which is designed based on the distributed execution requirements identified within the Kepler community. It also discusses mechanisms to make the presented distributed execution framework easy-to-use, comprehensive, adaptable, extensible and efficient. 1. Introduction Scientific workflow management systems, e.g., Taverna [1], Triana [2], Pegasus [3], Kepler [4], ASKALON [7] and SWIFT [12], have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. Scientific workflows can operate at different levels of granularity, from low-level workflows that explicitly move data around, start and monitor remote jobs, etc. to high-level "conceptual workflows" that interlink complex, domain specific data analysis steps. Distributed execution and Grid workflows can be seen as a type of scientific workflows. Most workflow systems centralize execution [5], which often causes a performance bottleneck. We summarize requirements within the Kepler community and propose our distributed execution framework to take advantage of abundant distributed computing resources to achieve better execution performance and efficiency. Based on community feedback, our goals for the Kepler distributed execution framework include the ability to easily form ad-hoc networks of cooperating Kepler instances. Each cooperating Kepler network can impose access constraints and allows Kepler models or sub-models to be run on participating instances. Once a Kepler cooperating network has been created, it can configure one or more subcomponents of a workflow to be distributed across nodes of the newly constructed network. The major contribution of this paper is demonstrating a distributed scientific workflow approach that combines an intuitive user interface, collaborative features, and capabilities for distribution of workflow tasks and the workflows themselves in a single framework. In Section 2, we discuss the background of scientific workflow distributed execution. Sections 3 and 4 describe the conceptual architecture and framework. We demonstrate a case study in Section 5 to show how the framework works. Finally, we conclude and explain future work in Section 6. 2. Background Our work is based on the following aspects: structure of scientific workflow specifications, typical distributed execution requirements as specified by scientists, and prior work in distributed execution. 2.1. Scientific Workflow Specification Structure There are several different formats for representing scientific workflows [14, 15, 16], but they generally are graph descriptions that can be used to represent three types of components: tasks, data and control dependencies [5]. For example, in Figure 1 the tasks T2 and T3 will be executed under different conditions. Additionally, T4 needs to get data from either T2 or T3 before its execution. Since our framework incorporates Fourth IEEE International Conference on eScience 978-0-7695-3535-7/08 $25.00 © 2008 IEEE DOI 10.1109/eScience.2008.166 634