Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mandal, Ewa Deelman, Gaurang Mehta, Mei-Hui Su, Karan Vahi USC Information Sciences Institute Marina Del Rey, CA 90292 {nandita, deelman, gmehta, mei, vahi}@isi.edu ABSTRACT Scientific workflows have become an important tool used by scientists to conduct large-scale analysis in distributed environments. Today there is a variety of workflow systems that provide an often disjoint set of capabilities and expose different workflow modeling semantics to the users. In this paper we examine the possibility of integrating two well-known workflow systems Kepler and Pegasus and examine the opportunities and challenges presented by such an integration. We illustrate the combined system on a workflow used as a basis of a provenance challenge. Categories and Subject Descriptors D.1 Programming Techniques. General Terms Design, Languages Keywords Scientific Workflows, Programming Models, User Interfaces 1. INTRODUCTION Scientific workflows are quickly becoming recognized as an important unifying mechanism to combine scientific data management, analysis, simulation, and visualization tasks [1]. Scientific workflows often exhibit particular traits, e.g., they can be data- intensive, compute-intensive, or visualization- intensive, thus covering a wide range of applications from low-level “plumbing workflows” of interest to grid engineers, to high-level “knowledge discovery workflows” for scientists [2]. There are many workflow management systems today, each with their own strengths and weaknesses. When designing workflows, scientists need to choose their target workflow management system and in the process often need to tradeoff between the various capabilities. In this paper we examine the possibility of integrating two well-known management systems: Kepler [2] and Pegasus [3] in the hopes of leveraging their respective strengths. We describe our initial integration and show the results of our approach using an example workflow which formed the basis of the provenance challenge [4] which aimed at comparing and contrasting provenance models developed within a variety of system, most of which were workflow- based. 2. KEPLER The Kepler scientific workflow system [2] provides domain scientists with an easy to-use system for capturing scientific workflows. Kepler attempts to streamline the workflow creation and execution process so that scientists can design, execute, monitor, re-run, and communicate analytical procedures repeatedly with minimal effort [5]. The system follows an actor-oriented modeling approach where individual workflow components (e.g., for data movement, database querying, job scheduling, remote execution etc.) are abstracted into a set of generic, reusable tasks. Instantiations of these common tasks can be functionally equivalent atomic components (called actors) or composite components (so-called composite actors or sub workflows) [6]. Figure 1 shows a snapshot of Kepler running a gene sequence workflow utilizing web services and data transformations. Kepler’s intuitive GUI (inherited from Ptolemy [7]) for design and execution, and its actor-oriented modeling paradigm make it a very versatile tool for workflow design, prototyping, execution, and reuse for both workflow engineers and end users. Kepler workflows can be exchanged in XML using Ptolemy’s own Modeling Markup Language (MoML). [9] 3. PEGASUS The Pegasus mapping and planning framework uses the concept of abstract workflows to describe and model abstract job computations in distributed environments, such as the grid. The framework creates a separation between the application description and the actual execution. Users describe workflows in resource-independent ways and Pegasus maps them onto potentially multiple heterogeneous resources distributed across the wide area networks, while at the same time shielding the user from grid details [3]. Pegasus finds appropriate resources to execute the computations and modifies the user-specified workflow to execute on those resources. Pegasus also adds tasks for data