Towards a Query Language for Unified Scientific Data Processing and Management AUTHORS Javad Chamanara (javad.chamanara@uni-jena.de) Heinz Nixdorf Endowed Chair for Distributed Information Systems Ernst-Abbe-Platz 2, 07743 Jena, Germany, +49 (0)3641-946444 Birgitta König-Ries (birgitta.koenig-ries@uni-jena.de) Heinz Nixdorf Endowed Chair for Distributed Information Systems Ernst-Abbe-Platz 2, 07743 Jena, Germany, +49 (0)3641-946430 KEYWORDS Scientific data processing, Data lifecycle management, scientific query language TOPIC/ ISSUE Ecological data warehousing including data archival, retrieval, sharing, mining and visualization BODY Motivation. Today’s data processing tools cope only insufficiently with the current and growing requirements of various disciplines of science [6]. Scientists and their observation, processing, and analysis tools produce a huge amount of data, in various formats. Produced data are schema-less, semi or fully structured persisting in different repositories. With the paradigm shift from computational science to data driven discovery [10], the need for full- fledged data lifecycle management utilities has emerged [6, 10, 15]. But what should a full-fledged data lifecycle management tool look like? In this abstract, we attempt to answer this question. We identify requirements and features of a universal data lifecycle management tool and contrast them with what existing tools like Google Refine [1], LINQ [9], Matlab [3], R [13], VisualDB [14], WQL [12], HTSQL [2], UnQL [5], Jaql [7], Kepler [11], VisTrails [17], Taverna [16], OPM [4] and others offer. Finally, we propose the SciQL (Scientific Query Language) toolset to overcome the limitations of existing approaches. Requirements. Scientists of various disciplines need to manage and process their data without facing too much IT. Data is typically processed in a series of steps performed by possibly different tools requiring different data formats and types. Thus, users frequently need to reformat or transform data. Also, often they will want to reuse previously prepared data in calculations, analyses or visualizations. Filtering, joining and/or clustering of data to feed the computational procedures are also common. During the process it is necessary to serialize intermediate or final results in persistent storage for later use, which brings up versioning and provenance management requirements.