Managing Rapidly-Evolving Scientiﬁc Workﬂows Juliana Freire, Cl´ audio T. Silva, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, and Huy T. Vo University of Utah Abstract. We give an overview of VisTrails, a system that provides an infrastructure for systematically capturing detailed provenance and streamlining the data exploration process. A key feature that sets Vis- Trails apart from previous visualization and scientiﬁc workﬂow systems is a novel action-based mechanism that uniformly captures provenance for data products and workﬂows used to generate these products. This mechanism not only ensures reproducibility of results, but it also sim- pliﬁes data exploration by allowing scientists to easily navigate through the space of workﬂows and parameter settings for an exploration task. 1 Introduction Workﬂow systems have been traditionally used to automate repetitive tasks and to ensure reproducibility of results [1, 6, 9, 10]. However, for applications that are exploratory in nature, and in which large parameter spaces need to be investi- gated, a large number of related workﬂows must be created. Data exploration and visualization, for example, require scientists to assemble complex workﬂows that consist of dataset selection, and speciﬁcation of series of algorithms and visualization techniques to transform, analyze and visualize the data. The work- ﬂow speciﬁcation is then adjusted in an iterative process, as the scientist gen- erates, explores and evaluate hypotheses about the data under study. Often, insight comes from comparing multiple data products. For example, by applying a given visualization process to multiple datasets; by varying the values of sim- ulation parameters; or by applying diﬀerent variations of a process (e.g., which use diﬀerent visualization algorithms) to a dataset. This places the burden on the scientist to ﬁrst generate a data product and then to remember the input data sets, parameter values, and the exact workﬂow conﬁguration that led to that data product. As a result, much time is spent manually managing these rapidly-evolving workﬂows, their relationships and associated data. Consider the problem of radiation treatment planning. Whereas a scanner can create a new dataset in minutes, using advanced dataﬂow-based visualization tools such as SCIRun [10], it takes from several hours to days to create appro- priate visualizations. Fig. 1 shows a series of visualizations generated from a CT scan of a torso—each visualization is created by a diﬀerent dataﬂow. During the exploratory process, a visualization expert needs to manually record information about how the dataﬂows evolve. Often, this is achieved through a combination of written notes and ﬁle-naming conventions. For planning the treatment of a single patient, it is not uncommon that a few hundred ﬁles are created to store