QFed: Query Set For Federated SPARQL Query Benchmark Nur Aini Rakhmawati INSIGHT Centre, National University of Ireland National University of Ireland, Galway nur.aini@deri.org Muhammad Saleem Universität Leipzi IFI/AKSW, PO 100920, D-04009 Leipzig saleem@informatik.uni- leipzig.de Sarasi Lalithsena Kno.e.sis Center Wright State University,Dayton, OH, USA sarasi@knoesis.org Stefan Decker INSIGHT Centre National University of Ireland, Galway stefan.decker@deri.org ABSTRACT The increasing attention for federated SPARQL query sys- tems emphasize necessity for benchmarking systems to eval- uate their performance. Most of the existing benchmark systems rely on a set of predefined static queries over a par- ticular set of data sources. Such benchmark are useful for comparing general purpose SPARQL query federation sys- tems such as FedX, SPLENDID etc. However, special pur- pose federation systems such as TopFed, SAFE etc. cannot be tested with these static benchmarks since these systems only operate on a specific data sets and the corresponding queries. To facilitate the process of benchmarking for such special purpose SPARQL query federation systems, we pro- pose QFed, a dynamic SPARQL query set generator that takes into account the characteristics of both dataset and queries along with the cost of data communication. Our experimental results show that QFed can successfully gen- erate a large set of meaningful federated SPARQL queries to be considered for the performance evaluation of different federated SPARQL query engines. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity mea- sures, performance measures General Terms Theory Keywords Linked Data, Data Integration, Federation SPARQL Query Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iiWAS ’14 Hanoi, Vietnam Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 1. INTRODUCTION Over the last years, the Web of Data has grown signif- icantly and currently comprise of over 2000 data sources 1 from diverse domains. Consequently, SPARQL query feder- ation approaches [15, 5, 1, 17, 11] has gained a significant importance to collect data from these globally distributed Linked Data sources. General purpose SPARQL query fed- eration systems (e.g., [15, 5, 1] etc.) have been actively used for SPARQL query federation over multiple SPARQL endpoints (Linked Data storage servers). Beside these, spe- cialised federation systems which are optimized for a specific use-case or data set (e.g., TopFed 2 [9, 10], SAFE 3 etc.) has gain a considerable attention as well. On the other hand, FedBench [13, 14] has been developed to compare and posi- tion such systems. While Fedbench are useful to test gen- eral purpose SPARQL query federation systems, it cannot be used for the performance evaluation of specialized fed- eration systems. This is because this benchmark comprises of predefined queries and a set (can be singleton) of data sources which may be completely irrelevant for the special- ized federation system in hand. For example, TopFed [9, 10] is a specialized federation system, particularly designed for Linked Cancer Genome Altas (LTCGA 4 ) [10] data set . However, to the best of our knowledge, non of the existing SPARQL query federation benchmarks include LTCGA data set and queries. Similarly, SAFE is developed for SPARQL query federation over Linked statical data using RDF Dat- aCube vocabulary and thus it is essential to test this system with RDF DataCube data sets and corresponding SPARQL queries. In a nutshell, to test such specialized systems, we need a dynamic query generator which can generate a vari- ety of federated SPARQL queries over the given set of data sources. To this end, we propose QFed, a dynamic query set gener- ator for federated SPARQL query benchmarks that takes into account both key characteristics of the dataset and SPARQL queries which has a direct impact on the perfor- 1 LOD Stats: http://stats.lod2.eu/ 2 TopFed federation engine: https://code.google.com/p/ topfed/ 3 SAFE federation engine: http://linked2safety.hcls. deri.org:8080/SAFE-Demo/ 4 Linked TCGA: http://tcga.deri.ie/