Managing Rapidly-Evolving Scientific Workflows Juliana Freire, Cl´ audio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo University of Utah Abstract. We give an overview of VisTrails, a system that provides an infrastructure for systematically capturing detailed provenance and streamlining the data exploration process. A key feature that sets Vis- Trails apart from previous visualization and scientific workflow systems is a novel action-based mechanism that uniformly captures provenance for data products and workflows used to generate these products. This mechanism not only ensures reproducibility of results, but it also sim- plifies data exploration by allowing scientists to easily navigate through the space of workflows and parameter settings for an exploration task. 1 Introduction Workflow systems have been traditionally used to automate repetitive tasks and to ensure reproducibility of results [1, 6, 9, 10]. However, for applications that are exploratory in nature, and in which large parameter spaces need to be investi- gated, a large number of related workflows must be created. Data exploration and visualization, for example, require scientists to assemble complex workflows that consist of dataset selection, and specification of series of algorithms and visualization techniques to transform, analyze and visualize the data. The work- flow specification is then adjusted in an iterative process, as the scientist gen- erates, explores and evaluate hypotheses about the data under study. Often, insight comes from comparing multiple data products. For example, by applying a given visualization process to multiple datasets; by varying the values of sim- ulation parameters; or by applying different variations of a process (e.g., which use different visualization algorithms) to a dataset. This places the burden on the scientist to first generate a data product and then to remember the input data sets, parameter values, and the exact workflow configuration that led to that data product. As a result, much time is spent manually managing these rapidly-evolving workflows, their relationships and associated data. Consider the problem of radiation treatment planning. Whereas a scanner can create a new dataset in minutes, using advanced dataflow-based visualization tools such as SCIRun [10], it takes from several hours to days to create appro- priate visualizations. Fig. 1 shows a series of visualizations generated from a CT scan of a torso—each visualization is created by a different dataflow. During the exploratory process, a visualization expert needs to manually record information about how the dataflows evolve. Often, this is achieved through a combination of written notes and file-naming conventions. For planning the treatment of a single patient, it is not uncommon that a few hundred files are created to store