Exploration in Web Science: Instruments for Web Observatories Marie Joan Kristine Gloria Rensselaer Polytechnic Institute Troy, NY glorim@rpi.edu Deborah L. McGuinness Rensselaer Polytechnic Institute Troy, NY dlm@cs.rpi.edu Joanne S. Luciano Rensselaer Polytechnic Institute Troy, NY jluciano@rpi.edu Qingpeng Zhang Rensselaer Polytechnic Institute Troy, NY zhangq6@rpi.edu ABSTRACT The following contribution highlights selected work conducted by Rensselaer Polytechnic Institute’s Web Science Research Center. (RPI WSRC). Specifically, it brings to light four different themed Web Observatories - Science Data, Health and Life Sciences, Open Government, and Social Spaces. Each of these observatories serves as a repository of data, tools, and methods that help answer complicated questions in each of these research areas. We present six case studies featuring tools and methods developed by RPI WSRC to aide in the exploration, discovery, and analysis of large data sets. These case studies along with our web observatory developments are aimed to increase our understanding of web science in general and to serve as test beds for our research. Categories and Subject Descriptors E.0 Data General Keywords Web Observatory, Linked Data, Methods, Semantic technologies 1. INTRODUCTION As the Web matures, academics and researchers agree on the need to create, deploy, enable, and foster mechanisms and tools for its exploration and sustainability. The goal of a Web Observatory is to mobilize a research community that leverages the strengths of multiple disciplines, methodologies, and theoretical frameworks. At Rensselaer Polytechnic Institute’s Tetherless World Constellation Web Science Research Center (RPI WSRC), our work addresses multiple facets of this goal including: the web’s infrastructure, transdisciplinary data exploration and analysis, visualization, and social networks. As such, our observatories present both tools and methodologies that empower researchers to study the web and to make a difference in the world. The RPI WSRC has four central themed observatories - science data, health and life sciences, open government, and social spaces. These four speak to a growing interest in each of these research areas and to where the collection and analysis of data has scaled significantly thanks to the Web. Our observatories include tools, collaborative processes, and methods that enable researchers to answer critical and complicated questions. To illustrate this, we present several case studies from each of these observatory themes. First, we introduce our most comprehensive observatory, the Science Data Observatory. Here we briefly discuss the SemantEco and Semantic Water Quality portal projects as exemplar projects; although, we have generated numerous science observatories. Second, the health and life sciences observatory includes our work on the Health and Human Services (HHS) Data Challenge. This effort is one of a number of health efforts, and it highlights a set of in-house developed tools that enabled the discovery of, access to, and integration of HHS’s datasets. More importantly, our contribution exposes the power and efficiency of a semantics enriched toolkit and process. Third, we turn to our work in the open government space, which we demonstrate with our International Open Government Data Set (IOGDS). The IOGDS is a linked data application based on metadata "scraped" from hundreds of international dataset catalog websites publishing a rich variety of government data. Lastly, the RPI WSRC is developing the tools and methods to explore social spaces with the First Responder’s Portal and the Twitter Network Observatory. Both enable the exploration of relationships and semantics in graph databases. In sharing our work, we hope to showcase how we can use the Web as a tool to study real world events; how semantics-enriched tools ease exploration within these Web observatories; and how we can now examine emerging communities on the Web. 2. SCIENCE DATA OBSERVATORY The Semantic Ecology and Environmental Portal (SemantEco) facilitates collaborative work across multiple disciplines by providing support tools to help manage, analyze, visualize, and present large complex ecosystems. This semantically-enabled environmental monitoring framework uses a family of ontologies, some domain-independent aimed at facilitating monitoring; including for example, the notion of pollution events - when contaminant measurements are outside of appropriate ranges. SemantEco provides an OWL-based reasoning scheme and provenance-based facet generation to leverage query answering and data validation over the integrated data via SPARQL [2]. Specific topic portals then include domain dependent ontologies. For example, the Semantic Water Quality Portal (SemantAqua) includes ontologies concerning containments relevant to water pollution and relevant regulations. SemantAqua uses description logic reasoning along with its ontologies to detect water pollution and facilities that violate water regulation. It preserves provenance metadata using the Proof Markup Language (PML) [7]. It is thus capable of providing detailed information about where the information came from and what inferences were done. It also integrates relevant health information so that it can connect health effects of exposure to high levels of contaminants. Copyright is held by the International World Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author's site if the Material is used in electronic media. WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil. ACM 978-1-4503-2038-2/13/05. 1325