A Modeling and Execution Environment for Distributed Scientiﬁc Workﬂows * Ilkay Altintas \ Sangeeta Bhagwanani + David Buttler * Sandeep Chandra + Zhengang Cheng + Matthew A. Coleman ‡ Terence Critchlow ‡ Amarnath Gupta \ Wei Han * Ling Liu * Bertram Lud¨ ascher \ Calton Pu * Reagan Moore \ Arie Shoshani † Mladen Vouk + 1 Introduction The Scientiﬁc Data Management Center project (short: SDM) is part of a large research program sponsored by the US Department of Energy (DOE) to enable Scien- tiﬁc Discovery through Advanced Computing [SDM02, Sci]. SDM brings together research teams from DOE labs and universities to address and resolve novel data management challenges that arise due to the new data and information centric ways in which science is con- ducted today. This demonstration illustrates how a domain scien- tist can perform a complex scientiﬁc task by interleav- ing data access, querying, and manipulation, as well as analytical steps and computations in complex, problem speciﬁc ways. We show how our system is used by a ge- neticist for solving the problem of discovering so-called “co-regulated” genes by interlinking data and computa- tion from several web sites, local computations, as well as local and remote databases. The main distinctive features of our system (compared, e.g., to the ZOO en- vironment [ILGP96]) include (i) executable workﬂows run as web services, (ii) abstract workﬂows employ con- cept names and semantic types that are higher-level (and thus more “scientist friendly”) than executable workﬂows, and (iii) our system supports automatic translation of the latter into the former. A Scientist’s Problem: Promoter Identiﬁcation Workﬂow (PIW). Through the Human Genome Sequencing Project a wealth of information has been gained at the nucleotide level. With the advent of DNA-based microarrays the wealth of data for in- terpretation is quickly becoming daunting. A start- ing point for discovery is to link genomic biology ap- proaches such as microarrays with bioinformatics to identify and characterize eukaryotic promoters – here, * Georgia Institute of Technology, † Lawrence Berkeley Labora- tory (LBL), ‡ Lawrence Livermore National Laboratory (LLNL), + North Carolina State University (NCSU), \ San Diego Super- computer Center (SDSC). This work was supported by DOE LLNL contract No. W-7405-Eng-48, SciDAC/SDM contract No. DE-FC02-01ER25486, and NSF grant No. ITR 0225676 (SEEK). we call this the promoter identiﬁcation workﬂow or PIW. 1 To clearly identify co-regulated groups of genes, high throughput computational molecular biology tools are ﬁrst needed that are scalable for carrying out a vari- ety of tasks such as identifying DNA sequences of inter- est, comparison of DNA sequences, and identiﬁcation of transcription factor binding sites, etc. Some of these steps can be executed by querying web-accessible databases and computation resources. However, using web sources “as-is” to enact scien- tiﬁc workﬂows requires many manual and thus time- consuming and error-prone steps. It is desirable to automate the scientiﬁc workﬂows such as the PIW as much as possible. A number of information technology and database challenges have to be overcome: • Most current web sources are made for human in- teraction and thus do not lend themselves easily to automation. Semiautomatic or automatic wrap- ping techniques have to be applied in order to turn interactive web sources into remote function invo- cations and database queries. • An execution environment for running distributed workﬂows over the web has to be devised. This includes capabilities for monitoring workﬂow ex- ecution, checkpointing, and re-running or resum- ing suspended runs. This is hard due to the au- tonomous nature of sources, their heterogeneous and limited access capabilities, and their occa- sional, unpredictable downtimes. • The design of scientiﬁc workﬂows poses unique challenges both to the domain scientist who drives the overall design and the IT expert who is charged with deﬁning the speciﬁc data and control ﬂow. This is due to the complexity of the scientiﬁc data, the complexity of the (often hidden) semantic links between the diﬀerent data sources, and the com- plexity of the syntactic and procedural intricacies that have to be overcome when chaining together actual web sources in the PIW. 1 a promoter is a subsequence of a chromosome that sits close to a gene and regulates its activity 1