Web-Scale Querying through Linked Data Fragments Ruben Verborgh Miel Vander Sande Pieter Colpaert Sam Coppens Erik Mannens Rik Van de Walle Multimedia Lab – Ghent University – iMinds Gaston Crommenlaan 8 bus 201 B-9050 Ledeberg-Ghent, Belgium {firstname.lastname}@ugent.be ABSTRACT To unlock the full potential of Linked Data sources, we need flexible ways to query them. Public sparql endpoints aim to fulfill that need, but their availability is notoriously problematic. We there- fore introduce Linked Data Fragments, a publishing method that allows ecient ooading of query execution from servers to clients through a lightweight partitioning strategy. It enables servers to maintain availability rates as high as any regular http server, al- lowing querying to scale reliably to much larger numbers of clients. This paper explains the core concepts behind Linked Data Fragments and experimentally verifies their Web-level scalability, at the cost of increased query times. We show how trading server-side query execution for inexpensive data resources with relevant aordances enables a new generation of intelligent clients. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords Linked Data, querying, availability, scalability, sparql 1. INTRODUCTION Whenever there is a large amount of data, people will want to query it—and nothing is more intriguing to query than the vast amounts of Linked Data published over the last few years [2]. With over 800 million triples in the widely used dbpedia [3], only one of the many datasets in a large ecosystem, the need for various specialized information searches has never been this high before. sparql has been specifically designed [30] to fulfill this requirement for reliable and standardized access to data in the rdf triple format. Consisting of a query language [15] and a protocol [9], sparql is the de facto choice to publish rdf data in a flexible way, and allows to select with high precision the data that interests us. There is one issue: it appears to be very hard to make a sparql endpoint available reliably. A recent survey examining 427 public endpoints concluded that only one third of them have an availabil- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LDOW2014 Seoul, Korea Copyright 2014 ACM ...$15.00. ity rate above 99%; not even half of all endpoints reach 95% [6]. To put this into perspective: 95% availability means the server is unavailable for one and a half days every month. These figures are quite disturbing given the fact that availability is usually measured in “number of nines” [5, 25], counting the number of leading nines in the availability percentage. In comparison, the fairly common three nines (99.9%) amounts to 8.8 hours of downtime per year. The disappointingly low availability of public sparql endpoints is the Semantic Web community’s very own “Inconvenient Truth”. More precisely, practice reveals that the following three charac- teristics are irreconcilable for sparql endpoints: a) being publicly available; b)oering unrestricted queries; c) having many concurrent users. This is because the load of a server is proportional to the product of the variety, complexity, and amount of requests, the first two of which remain virtually unbounded for sparql. Any endpoint’s availability can be considerably improved by sacrificing one of these three characteristics: private sparql endpoints perform well because the server load can be predicted more reliably, limiting query possibilities eliminates slow queries that can bring down the server, and low demand of course contributes positively to availabil- ity. http servers on the other hand have no problem combining these characteristics, as the complexity of each request can be limited because the server restricts what “queries” a client can execute by determining the oered http resources [13]. We do not claim by any means this comparison is fair, as sparql servers have to per- form significantly more work per request. On the contrary, it is exactly because sparql requests require more processing that sparql endpoints do not scale well compared to http servers. This paper challenges the idea that servers should spend their cpu cycles on expensive queries, and proposes a model in which the client solves a complex query by only asking the server for simple data retrieval operations. Instead of answering a complex sparql query, the server sends a Linked Data Fragment that corresponds to a specific triple pattern. This fragment then contains metadata that allows the client itself to execute the complex query. While this leads to an increased number of http requests between clients and servers, each request is answered easily and also fully cacheable. Therefore, this is the scalable and sustainable approach to Web querying: with sparql, each new client requires additional processing power from the server, whereas with Linked Data Fragments, clients take care of their own processing. We eectively trade fast answers but low scalability for increased (yet manageable) query times with Web-level scalability. Most importantly, thismakes it possible to fully query datasets of publishers who cannot invest in hosting and maintaining an expensive sparql endpoint—which is most of us.