SAMUEL:AS haring-based A pproach to processing Mu ltiple SPARQL Qeries with MapReduce ∗ InA Kim Dep’t of Computer Engineering Chungnam National University Daejeon, 34134, Korea dodary0214@gmail.com Kyong-Ha Lee 2 Korea Institute of Science and Technology Information Daejeon, 34141, Korea bart7449@gmail.com Kyu-Chul Lee Dep’t of Computer Engineering Chungnam National University Daejeon, 34134, Korea kclee@cnu.ac.kr ABSTRACT The volume of RDF data is now growing tremendously. It is thus considered prudent to store and process massive RDF data with distributed SPARQL engines instead of relying on a single- machine system. Many sophisticated index and partitioning schemes have also been proposed to support SPARQL query evaluations. However, existing SPARQL engines have mainly followed one- at-a-time scheme so that query evaluation is focused only on processing each query separately. We showcase SAMUEL, a dis- tributed SPARQL engine that simultaneously evaluates many SPARQL queries for a massive RDF dataset with MapReduce. SAMUEL provides an efcient optimization algorithm to evaluate many SPARQL queries simultaneously in a shared and balanced way. Extensive experiments present that without any sophisti- cated partitioning or index mechanisms, our approach signif- cantly outperforms other MapReduce-based SPARQL engines as well as an ad-hoc query engine equipped with various indexes and partitioning tools for evaluating multiple SPARQL queries. 1 INTRODUCTION The Resource Description Framework (RDF) is a versatile graph data model that enables users to express facts and their relation- ships in the form of triples ⟨ Subject, Predicate, Object ⟩ where a predicate (P ) expresses a relationship between a subject (S) and an object (O)[2]. Many knowledge bases are now defned in the RDF format, e.g., DBpedia [16], Bio2RDF [6] and UniProt [7] and they shape a very large set of graphs having millions of triples interlinked each other and are queried by using SPARQL query language [1]. It has been a major challenge to fnd subgraph patterns described in given SPARQL queries from a massive set of RDF graphs with supporting both efciency and scalability. To address the issue, numerous SPARQL query engines have been devised based on MapReduce [15] or their proprietary dis- tributed architectures [3]. Meanwhile, multiple SPARQL queries often need to be evaluated together as the queries can be pre- pared before runtime in some scenarios [13, 20]. Motivated by these facts and our previous work on XML data [8], we devise a MapReduce-based SPARQL engine called SAMUEL, which si- multaneously evaluates many SPARQL queries in a shared and balanced way. Major features of SAMUEL are as follows : Support of parallel multi-SPARQL query processing SAMUEL provides an efcient means to process a massive set of ∗ This work is equally supported by KISTI and NRF grant (No. NRF-2015R1A2A2 A01004879) funded by the Korea government. 2 Corresponding author © 2018 Copyright held by the owner/author(s). Published in Proceedings of the 21st International Conference on Extending Database Technology (EDBT), March 26-29, 2018, ISBN 978-3-89318-078-3 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. RDF data in parallel. It does not require any sophisticated par- titioning or index mechanisms for SPARQL query evaluation. Nonetheless, SAMUEL easily outperforms other MapReduce- based distributed SPARQL engines by simultaneously evaluating multiple SPARQL queries with a short series of MapReduce jobs. Sharing input scans and intermediate results SAMUEL enables computing nodes to share input scans and their intermediate results with each other. While joining RDF triples for evaluating queries, a group of join operations assigned to each reducer share their input and intermediate results associated with distinct subquery patterns that multiple queries commonly contain. Consequently, it saves many I/Os by removing many redundant subquery matchings while evaluating queries. Runtime load balancing and multi-query optimization In a distributed system, a straggling task delays overall job execu- tion. This problem deteriorates with the use of MapReduce since MapReduce typically enforces barrier synchronization between Map and Reduce tasks. MapReduce’s native runtime scheduling algorithm also proved to be inefcient especially in the reduce stage [12, 15]. To address the issue, SAMUEL rather exploits dy- namic shufing scheme that balances workloads across reducers at each MapReduce job in accordance with the cardinality of RDF triples that each reducer consumes for joining. It decomposes a given set of SPARQL queries into distinct triple patterns and then gradually builds a bushy query plan tree that covers all RDF join operations required to evaluate the given SPARQL queries. Since processing join operations at each level in the query plan tree requires a single M/R job, SAMUEL always tries to build a bushy plan tree with the lowest possible height. Then, it partitions the join operations at each level into n groups and then assigns them to n reducers. To balance workloads across reducers that actually perform join operations, SAMUEL computes the cost of each join operation at each level of the plan tree before actual joining. It then assigns join operations into reducers at each level such that every reducer has the same overall cost of join operations and the lowest communication cost by avoiding redundant RDF triples being transferred to multiple reducers so far as possible. The rest of this paper is organized as follows. Section 2 intro- duces previous studies directly related to our work. Section 3 describes how we evaluate multiple SPARQL queries together at a time and our system architecture that provides workload balancing as well as multi-SPARQL query processing. Section 4 presents our demonstration scenario including the major results of performance evaluations. 2 RELATED WORK Numerous distributed SPARQL engines have been devised to store RDF data and to evaluate SPARQL queries so far [3, 9, 11, 14, 19, 21, 22]. Readers are referred to a recent survey on distributed SPARQL query engines [3]. These systems fall into Demonstration Series ISSN: 2367-2005 666 10.5441/002/edbt.2018.80