A Modeling and Execution Environment for Distributed Scientific Workflows * Ilkay Altintas \ Sangeeta Bhagwanani + David Buttler * Sandeep Chandra + Zhengang Cheng + Matthew A. Coleman ‡ Terence Critchlow ‡ Amarnath Gupta \ Wei Han * Ling Liu * Bertram Lud¨ ascher \ Calton Pu * Reagan Moore \ Arie Shoshani † Mladen Vouk + 1 Introduction The Scientific Data Management Center project (short: SDM) is part of a large research program sponsored by the US Department of Energy (DOE) to enable Scien- tific Discovery through Advanced Computing [SDM02, Sci]. SDM brings together research teams from DOE labs and universities to address and resolve novel data management challenges that arise due to the new data and information centric ways in which science is con- ducted today. This demonstration illustrates how a domain scien- tist can perform a complex scientific task by interleav- ing data access, querying, and manipulation, as well as analytical steps and computations in complex, problem specific ways. We show how our system is used by a ge- neticist for solving the problem of discovering so-called “co-regulated” genes by interlinking data and computa- tion from several web sites, local computations, as well as local and remote databases. The main distinctive features of our system (compared, e.g., to the ZOO en- vironment [ILGP96]) include (i) executable workflows run as web services, (ii) abstract workflows employ con- cept names and semantic types that are higher-level (and thus more “scientist friendly”) than executable workflows, and (iii) our system supports automatic translation of the latter into the former. A Scientist’s Problem: Promoter Identification Workflow (PIW). Through the Human Genome Sequencing Project a wealth of information has been gained at the nucleotide level. With the advent of DNA-based microarrays the wealth of data for in- terpretation is quickly becoming daunting. A start- ing point for discovery is to link genomic biology ap- proaches such as microarrays with bioinformatics to identify and characterize eukaryotic promoters – here, * Georgia Institute of Technology, † Lawrence Berkeley Labora- tory (LBL), ‡ Lawrence Livermore National Laboratory (LLNL), + North Carolina State University (NCSU), \ San Diego Super- computer Center (SDSC). This work was supported by DOE LLNL contract No. W-7405-Eng-48, SciDAC/SDM contract No. DE-FC02-01ER25486, and NSF grant No. ITR 0225676 (SEEK). we call this the promoter identification workflow or PIW. 1 To clearly identify co-regulated groups of genes, high throughput computational molecular biology tools are first needed that are scalable for carrying out a vari- ety of tasks such as identifying DNA sequences of inter- est, comparison of DNA sequences, and identification of transcription factor binding sites, etc. Some of these steps can be executed by querying web-accessible databases and computation resources. However, using web sources “as-is” to enact scien- tific workflows requires many manual and thus time- consuming and error-prone steps. It is desirable to automate the scientific workflows such as the PIW as much as possible. A number of information technology and database challenges have to be overcome: • Most current web sources are made for human in- teraction and thus do not lend themselves easily to automation. Semiautomatic or automatic wrap- ping techniques have to be applied in order to turn interactive web sources into remote function invo- cations and database queries. • An execution environment for running distributed workflows over the web has to be devised. This includes capabilities for monitoring workflow ex- ecution, checkpointing, and re-running or resum- ing suspended runs. This is hard due to the au- tonomous nature of sources, their heterogeneous and limited access capabilities, and their occa- sional, unpredictable downtimes. • The design of scientific workflows poses unique challenges both to the domain scientist who drives the overall design and the IT expert who is charged with defining the specific data and control flow. This is due to the complexity of the scientific data, the complexity of the (often hidden) semantic links between the different data sources, and the com- plexity of the syntactic and procedural intricacies that have to be overcome when chaining together actual web sources in the PIW. 1 a promoter is a subsequence of a chromosome that sits close to a gene and regulates its activity 1