Processing Analytical Qeries in the AWESOME Polystore [Technical Report] Xiuwen Zheng, Subhasis Dasgupta, Arun Kumar, Amarnath Gupta University of California, San Diego xiz675@eng.ucsd.edu,sudasgupta@ucsd.edu,arunkk@eng.ucsd.edu,a1gupta@ucsd.edu ABSTRACT Modern big data applications usually involve heterogeneous data sources and analytical functions, leading to increasing demand for polystore systems, especially analytical polystore systems. This paper presents AWESOME system along with a domain-specifc language ADIL. ADIL is a powerful language which supports 1) native heterogeneous data models such as Corpus, Graph, and Rela- tion; 2) a rich set of analytical functions; and 3) clear and rigorous semantics. AWESOME is an efcient tri-store middle-ware which 1) is built on the top of three heterogeneous DBMSs (Postgres, Solr, and Neo4j) and is easy to be extended to incorporate other systems; 2) supports the in-memory query engines and is equipped with analytical capability; 3) applies a cost model to efciently execute workloads written in ADIL; 4) fully exploits machine resources to improve scalability. A set of experiments on real workloads demon- strate the capability, efciency, and scalability of AWESOME. PVLDB Reference Format: Xiuwen Zheng, Subhasis Dasgupta, Arun Kumar, Amarnath Gupta. Processing Analytical Queries in the AWESOME Polystore [Technical Report]. PVLDB, 14(1): XXX-XXX, 2020. doi:XX.XX/XXX.XX 1 INTRODUCTION Since their inception in 2015 [12], polystore systems [16, 19, 23, 29] have become a signifcant area of data management research. In a polystore, a common query processing facility is constructed over a number of data management systems, enabling a user to specify queries across stores. As polystores started getting applied to diferent application domains, it became clear that polystore system must not only support cross-model queries across data stores, but also support analytical operations, a term we use to loosely refer to operations that perform a computation instead of data manipulation and are typically not natively provided by a DBMS but by external software libraries. The example analytical operators include tasks like centrality computation on graphs, entity extraction from text, and classifcation tasks on relational data. Many application areas, especially social and behavioral sciences, have workloads that combine multi-model data queries with an- alytical operations. [18] presents a workload for the discovery of cyberbullying on social media that uses text classifcation, network analysis on relational data obtained from Instagram. The EventSKG This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. doi:XX.XX/XXX.XX Solr "corona" "covid" "pandemic" "vaccine" News Collection keywords Named Entity Recognition PostgreSQL US Senetor DB Neo4j Join Twitter Social Network documents containing keywords query users DB query data flow Figure 1: Illustration of PoliSci workload. system [32] uses relational, textual and graph data to construct an event knowledge graph for public safety incidents by mining social media data. In the domain of intellectual property analytics, [8] constructs an integrated network by combining patent informa- tion with technology reports, news and web data. [17] collects data from multiple relational and non-relational sources and construct fnancial transaction networks, and apply natural language and deep learning to detect money laundering activities. Each of these workloads can be viewed as a complex combination of 1) query operations against multiple sources including graph DBMS, text DBMS and relational DBMS, and 2) analytical functions on inter- mediate results with diferent data models. However, the existing polystore systems fail to support this kind of analytical workloads. Some polystore systems [6, 21, 28] focus on querying and storing data to diferent DBMSs without the support of built-in analyti- cal functions; some [4, 28] lack the support for graph DBMS or text DBMS; and some [5, 6, 30] use a unifed data model without supporting heterogeneous data models in native. Thus, there is an urgent demand to develop a new polystore system which treats the heterogeneous data models as frst citizens and supports common analytical functions over them. 1.1 System and Language Design Decisions We frst present a motivating workload named PoliSci in detail and will use it as a running example to demonstrate our system and language design decisions. Example 1.1 (PolySci Workload). As illustrated in Figure 1, given a set of keywords about Covid-19, recent news articles containing any of them are found out through text queries against a Solr [1] document database. Then, named entity recognition operation (an analytical function) is applied on the collected documents to retrieve named entities (e.g., President Trump). The returned entity list is then joined with a Twitter handler table for the US senators, which is stored as a PostgreSQL [3] relational database, to obtain the Twitter users for named entities who are senators. Finally, the Twitter social 1 arXiv:2112.00833v1 [cs.DB] 1 Dec 2021