Resource Planning for SPARQL Query Execution on Data Sharing Platforms Stefan Hagedorn 1 , Katja Hose 2 , Kai-Uwe Sattler 1 , and J ¨ urgen Umbrich 3 1 Ilmenau University of Technology, Germany 2 Aalborg University, Denmark 3 Vienna University of Economy and Business, Austria Abstract. To increase performance, data sharing platforms often make use of clusters of nodes where certain tasks can be executed in parallel. Resource plan- ning and especially deciding how many processors should be chosen to exploit parallel processing is complex in such a setup as increasing the number of proces- sors does not always improve runtime due to communication overhead. Instead, there is usually an optimum number of processors for which using more or fewer processors leads to less efﬁcient runtimes. In this paper, we present a cost model based on widely used statistics (VoiD) and show how to compute the optimum number of processors that should be used to evaluate a particular SPARQL query over a particular conﬁguration and RDF dataset. Our ﬁrst experiments show the general applicability of our approach but also how shortcomings in the used statis- tics limit the potential of optimization. 1 Introduction In recent years, we have witnessed the growing interest and necessity of enabling large- scale solutions for SPARQL query processing. In particular, the current trend of emerg- ing platforms for sharing (Linked) Open Data aims at enabling convenient access to large numbers of datasets and huge amounts of data. One widely known system is the Datahub 4 , a service provided by the Open Knowledge Foundation. Fujitsu published a single point of entry to the Linked Open Data Cloud 5 with the aim of hosting most datasets (as the license allows it) in the same format and in one place. Such a platform can be seen as a catalog component in a dataspace [4,18] with the main purpose to in- ventorize available datasets and their meta information. However, such platforms do not yet provide processing capabilities over the registered datasets and it is left to the users to provide additional services such as querying and analytics. Similar to [8], we believe that the next generation of such Linked Data catalogs will provide services to trans- form, integrate, query, and analyze data across all hosted datasets, considering access policies and restrictions. Those services will be able to exploit that such platforms are (i) typically deployed on a distributed (cloud) infrastructure with multiple processors and (ii) with multiple datasets available. Advanced query functionalities will be a core service to enable users to explore and analyze the data. However, different privacy restrictions and licenses pose a big chal- lenge on centralized and distributed query processing approaches. As such, it is very likely that those platforms keep datasets in separate partitions. While the research for 4 http://datahub.io/ 5 http://lod4all.net/