Knowledge Provenance in Virtual Observatories: Application to Image Data Pipelines Peter Fox HAO/ESSL/NCAR PO Box 3000 Boulder, CO 80307 (1) 303-497-1511 pfox@ucar.edu Deborah McGuinness Tetherless World Constellation Rensselaer Polytechnic Institute 110 8 th St, Troy NY 12180 (1) 518-276-4404 dlm@cs.rpi.edu Paulo Pinheiro da Silva Department of Computer Science University of Texas El Paso El Paso, Texas, 79968-0518 (1) 915 - 747-6827 paulo@utep.edu ABSTRACT Scientific data services are increasing in usage and scope, and with these increases comes growing need for access to provenance information. Our goal is to design and implement an extensible provenance solution that is deployed at the science data ingest time. In this paper, we describe our work in the setting of a particular set of data services in the area of solar coronal physics. The paper focuses on one existing federated data service and one proposed observatory. Our claim is both that the design and implementation are useful for the particular scientific image data services we designed for, but further that the design provides an operational specification for other scientific data applications. We highlight the need for and usage of semantic technologies and tools in our design and implemented service. Categories and Subject Descriptors H.2.5 {Heterogeneous Databases}: {Data translation}, I.2.4 [Knowledge Representation Formalisms and Methods]: Relation systems, Representation languages, Representations (procedural and rule-based), Semantic networks. Keywords Provenance; Image processing; semantics: markup, explanation, justification. 1. INTRODUCTION Our goal is to create a next generation virtual observatory that includes an extensible representation for provenance for data ingest systems. Further, we consider provenance to be a first class item and our system will support semantically-enabled queries over the provenance as well as using provenance to filter data requests. In order to test our design, we are implementing our work using a domain in solar coronal physics. Our initial target is the Advanced Coronal Observing System (ACOS) currently operated at the Mauna Loa Solar Observatory (MLSO). The design is also expected to be implemented in the proposed CoSMO (Coronal Solar Magnetism Observatory). We illustrate our setting in Figure 1 which is an abstracted representation of a typical data ingest pipeline for solar physics data streams. From the perspective of a provenance system, some aspects of the service are worth highlighting. First, data (represented in square boxes in the figure) passes through a number of stages and is potentially subject to a significant amount of possibly complex processing steps. Second, each collection and processing pass, including analysis and manipulations by humans, provides a place where provenance information could be and should be collected and represented. Third, quality control loops provide additional opportunities for provenance collection (and inspection). The motivation for this project arose from our experiences designing and deploying a solar terrestrial physics virtual observatory system [1, 2] and from numerous discussions with the ‘data’ providers (i.e. roles) in Fig. 1. Among their remarks were the following: • Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision; • We often fail to capture, represent and propagate manually generated information that need to go with the data flows; • Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects. Further, when science data and information (often in the form of graphical images as is the case in our initial deployment)) are made available to an end-user (any of the roles in Fig. 1), it often happens after a number of data filtration and processing steps. As a consequence, any important metadata and/or documentation that may be needed to answer questions about the provenance may not have been generated, saved, propagated or be in a form or location that can be utilized (at all, or without significant effort or expertise). Virtual Observatories are particularly prone to this information gap. Thus, this project traces the entire pipeline and accounts for all roles, processes and metadata as they relate to use cases, which require provenance. 2. USE CASES Use Case Development: After discussion and several meetings with the science project participants, we developed an initial set of use cases, which reflect a range of actual questions that are asked but at present cannot be answered in any routine or repeatable manner. • Who (person or program) added the comments to the science data file for the best vignetted, rectangular polarization brightness image from January, 26, 2005 1849:09UT taken by the ACOS Mark IV polarimeter? • What was the cloud cover and atmospheric seeing conditions during the local morning of January 26, 2005 at MLSO? • Find all good images on March 21, 2008. • Why are the quick look images from March 21, 2008, 1900UT missing? • Why does this image look bad?