Real-Time Collaborative Analysis with (Almost) Pure SQL: A Case Study in Biogeochemical Oceanography Daniel Halperin University of Washington dhalperi@cs.uw.edu Konstantin Weitz University of Washington weitzkon@cs.uw.edu Bill Howe University of Washington billhowe@cs.uw.edu Francois Ribalet University of Washington ribalet@uw.edu Mak A. Saito Woods Hole Oceanographic Institute msaito@whoi.edu E. Virginia Armbrust University of Washington armbrust@uw.edu ABSTRACT We consider a case study using SQL-as-a-Service to support “instant analysis” of weakly structured relational data at a multi-investigator science retreat. Here, “weakly structured” means tabular, rows-and-columns datasets that share some common context, but that have limited a priori agreement on ﬁle formats, relationships, types, schemas, metadata, or semantics. In this case study, the data were acquired from hundreds of distinct locations during a multi-day oceano- graphic cruise using a variety of physical, biological, and chemical sensors and assays. Months after the cruise when preliminary data processing was complete, 40+ researchers from a variety of disciplines participated in a two-day“data synthesis workshop.” At this workshop, two computer sci- entists used a web-based query-as-a-service platform called SQLShare to perform“SQL stenography”: capturing the scien- tiﬁc discussion in real time to integrate data, test hypotheses, and populate visualizations to then inform and enhance fur- ther discussion. In this “ﬁeld test” of our technology and approach, we found that it was not only feasible to support interactive science Q&A with essentially pure SQL, but that we signiﬁcantly increased the value of the “face time” at the meeting: researchers from diﬀerent ﬁelds were able to validate assumptions and resolve ambiguity about each others’ ﬁelds. As a result, new science emerged from a meeting that was originally just a planning meeting. In this paper, we describe the details of this experiment, discuss our major ﬁndings, and lay out a new research agenda for collaborative science database services. Categories and Subject Descriptors H.2.8 [Database Applications]: Scientiﬁc Databases General Terms Design, Experimentation, Human Factors, Management 1. INTRODUCTION Data analysis is replacing data acquisition as the bottleneck to scientiﬁc discovery. The challenges associated with high- volume data have received signiﬁcant attention [10], but the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from Permissions@acm.org. SSDBM ’13, July 29–31 2013, Baltimore, MD, USA Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00 challenges related to integrating weakly structured, high- variety data —hundreds of datasets with hundreds of columns and no a priori agreement on format or semantics—are under- studied. Even at small scales, our collaborators report that these situations require them to spend up to 90% of their time on data handling tasks that have little to do with the science [5]. We posit that the use of declarative query languages can sig- niﬁcantly reduce the overhead of working with weakly struc- tured relational data, allowing real-time, discussion-oriented scientiﬁc Q&A as opposed to relying on oﬄine programming. To test this hypothesis, we have designed and deployed a web-based query-as-a-service system called SQLShare [5] 1 that emphasizes a simple Upload/Query/Share workﬂow over heavyweight database engineering and administration tasks. Data can be uploaded to SQLShare “as is” and queried di- rectly; a basic schema is inferred from the column headers and data types. Queries can be saved as views and shared with colleagues by exchanging URLs. In prior work, we found that this approach can capture most relevant tasks and improve productivity for distributed, asynchronous collaboration [6]. In this paper, we consider whether our query-as-a-service approach can also be used to improve productivity in real- time, synchronous, face-to-face collaboration, even without assuming that the data has been integrated into some pre- engineered schema. The challenges are signiﬁcant: data must be cleaned and integrated, and science questions must be disambiguated and encoded in SQL, all on-the-ﬂy. When successful, this level of interactivity for scientiﬁc Q&A is not just faster, it is a diﬀerent experience. The availability of instant results to questions arising from organic discussion changes the nature of the meeting: instead of assigning action items for investigators to complete oﬄine when the “trail is cold”, the researchers can test hypotheses and explore the implications online, during the meeting, while the ideas are fresh and everyone’s perspective can be incorporated—“data- driven discussion.” We test this approach in the context of the GeoMICS project [1], a multi-institution, multi-disciplinary oceanographic collabo- ration between geochemists and molecular ecologists spear- headed by co-author Armbrust. The team acquired data during a research cruise in May 2012 in the northeast Paciﬁc Ocean. The overall purpose of the cruise was two-fold. The scientiﬁc goal of the cruise was to study a well-deﬁned tran- sition zone between coastal and open-ocean waters [9]. To 1 https://sqlshare.escience.washington.edu/ 1