Real-Time Collaborative Analysis with (Almost) Pure SQL: A Case Study in Biogeochemical Oceanography Daniel Halperin University of Washington dhalperi@cs.uw.edu Konstantin Weitz University of Washington weitzkon@cs.uw.edu Bill Howe University of Washington billhowe@cs.uw.edu Francois Ribalet University of Washington ribalet@uw.edu Mak A. Saito Woods Hole Oceanographic Institute msaito@whoi.edu E. Virginia Armbrust University of Washington armbrust@uw.edu ABSTRACT We consider a case study using SQL-as-a-Service to support “instant analysis” of weakly structured relational data at a multi-investigator science retreat. Here, “weakly structured” means tabular, rows-and-columns datasets that share some common context, but that have limited a priori agreement on file formats, relationships, types, schemas, metadata, or semantics. In this case study, the data were acquired from hundreds of distinct locations during a multi-day oceano- graphic cruise using a variety of physical, biological, and chemical sensors and assays. Months after the cruise when preliminary data processing was complete, 40+ researchers from a variety of disciplines participated in a two-day“data synthesis workshop.” At this workshop, two computer sci- entists used a web-based query-as-a-service platform called SQLShare to perform“SQL stenography”: capturing the scien- tific discussion in real time to integrate data, test hypotheses, and populate visualizations to then inform and enhance fur- ther discussion. In this “field test” of our technology and approach, we found that it was not only feasible to support interactive science Q&A with essentially pure SQL, but that we significantly increased the value of the “face time” at the meeting: researchers from different fields were able to validate assumptions and resolve ambiguity about each others’ fields. As a result, new science emerged from a meeting that was originally just a planning meeting. In this paper, we describe the details of this experiment, discuss our major findings, and lay out a new research agenda for collaborative science database services. Categories and Subject Descriptors H.2.8 [Database Applications]: Scientific Databases General Terms Design, Experimentation, Human Factors, Management 1. INTRODUCTION Data analysis is replacing data acquisition as the bottleneck to scientific discovery. The challenges associated with high- volume data have received significant attention [10], but the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. SSDBM ’13, July 29–31 2013, Baltimore, MD, USA Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00 challenges related to integrating weakly structured, high- variety data —hundreds of datasets with hundreds of columns and no a priori agreement on format or semantics—are under- studied. Even at small scales, our collaborators report that these situations require them to spend up to 90% of their time on data handling tasks that have little to do with the science [5]. We posit that the use of declarative query languages can sig- nificantly reduce the overhead of working with weakly struc- tured relational data, allowing real-time, discussion-oriented scientific Q&A as opposed to relying on offline programming. To test this hypothesis, we have designed and deployed a web-based query-as-a-service system called SQLShare [5] 1 that emphasizes a simple Upload/Query/Share workflow over heavyweight database engineering and administration tasks. Data can be uploaded to SQLShare “as is” and queried di- rectly; a basic schema is inferred from the column headers and data types. Queries can be saved as views and shared with colleagues by exchanging URLs. In prior work, we found that this approach can capture most relevant tasks and improve productivity for distributed, asynchronous collaboration [6]. In this paper, we consider whether our query-as-a-service approach can also be used to improve productivity in real- time, synchronous, face-to-face collaboration, even without assuming that the data has been integrated into some pre- engineered schema. The challenges are significant: data must be cleaned and integrated, and science questions must be disambiguated and encoded in SQL, all on-the-fly. When successful, this level of interactivity for scientific Q&A is not just faster, it is a different experience. The availability of instant results to questions arising from organic discussion changes the nature of the meeting: instead of assigning action items for investigators to complete offline when the “trail is cold”, the researchers can test hypotheses and explore the implications online, during the meeting, while the ideas are fresh and everyone’s perspective can be incorporated—“data- driven discussion.” We test this approach in the context of the GeoMICS project [1], a multi-institution, multi-disciplinary oceanographic collabo- ration between geochemists and molecular ecologists spear- headed by co-author Armbrust. The team acquired data during a research cruise in May 2012 in the northeast Pacific Ocean. The overall purpose of the cruise was two-fold. The scientific goal of the cruise was to study a well-defined tran- sition zone between coastal and open-ocean waters [9]. To 1 https://sqlshare.escience.washington.edu/ 1