Efficient evaluation of SPARQL over RDFS graphs Jorge Pérez Marcelo Arenas Departamento de Ciencia de la Computaci´ on Pontiﬁcia Universidad Cat´ olica de Chile Claudio Gutierrez Departamento de Ciencia de la Computaci´ on Universidad de Chile ABSTRACT Evaluating queries over RDFS is currently done by comput- ing the closure, which is quadratic in the worst case. In this paper, we show how to evaluate SPARQL over RDFS graphs by using a navigational language grounded in standard re- lational algebra with operators for navigating triples with regular constraints. In particular, we show that there is a fragment of this language with linear time data-complexity which is enough to perform RDFS evaluation of standard SPARQL queries. 1. INTRODUCTION The Resource Description Framework (RDF) [6] is a graph data model for representing information about World Wide Web resources. Every piece of information in RDF is de- scribed as a subject-predicate-object triple that represents a relationship between two resources. Furthermore, the RDF speciﬁcation includes a set of reserved words, the RDFS vo- cabulary, that has a predeﬁned semantics. Evaluating queries over RDF graphs that take into account the predeﬁned semantics of the RDFS vocabulary is chal- lenging. Current approaches (e.g. [4, 2]) implement the following procedure for evaluating a query Q over a data source G with RDFS vocabulary: compute the closure of G, that is, make explicit all the implicit information con- tained in G by adding to it all the triples that are semantic consequence of G, and then evaluate Q over this extended graph. From a practical point of view the above approach has sev- eral drawbacks. First, it was shown in [8] that the size of the closure of G is quadratic in the worst case, making the computation and storage of the closure too expensive for web-scale applications. Second, once the closure has been computed, all the queries are evaluated over a graph which can be much larger than the original graph. This can be par- ticularly bad for queries that scan a large part of the input graph. Third, the approach is not goal-oriented. Although a query could mention only a small part of the RDFS vocab- ulary, all the vocabulary is used in computing the closure. It was shown in [8] that one can determine if a triple is implied by an RDFS graph by checking the existence of cer- tain paths over the graph. For every triple, such a path can be speciﬁed using regular expressions plus some addi- tional features. For example, to check whether a resource A represents a subclass of a resource B, that is whether the triple (A, rdfs:subClassOf, B) is a semantic consequence of a graph data source, it is enough to check whether there exists a path from A to B in the graph in which every edge is labeled rdfs:subClassOf. Then, to determine if some re- source is a subclass of another we can use the expression (rdfs:subClassOf) + . Driven by this motivation, we introduce a language for nav- igating RDF data. The formal design of the language is grounded on standard relational algebra plus Kleene star, with operators for navigating triples subject to constraints. It is an algebraic language of binary relations, with opera- tions of composition, inverse, union, Kleene star, plus three operators that navigate between two any entries of a triple with a constraint in the third entry which is expressible in the same language. We also isolate a fragment of this lan- guage which captures the deductive rules of RDFS, thus obtaining a sublanguage enough to obtain the RDFS eval- uation of SPARQL [9] queries and with linear time data- complexity. 2. NAVIGATIONAL RDF LANGUAGE The basic notion of the language is that of navigational ex- pression, which is deﬁned by the following grammar: exp := self | self::a (a ∈ U ) | axis | axis::a (a ∈ U ) | axis::[exp ] | exp /exp | exp |exp | exp * where axis ∈{next, next -1 , edge, edge -1 , node, node -1 }, and U represents the universe of URIs. The language allows diﬀerent forms of navigation in RDF graphs. The most natural axis is next::a which will be in- terpreted as the a-neighbor relation in G, that is, the pairs of nodes (x, y) such that (x, a, y) ∈ G. Given that in the RDF data model a node can also be the label of an edge, the language allows to navigate from a node to one of its leaving edges by using the edge axis. More formally, the in- terpretation of edge::a will be the pairs of nodes (x, y) such that (x,y,a) ∈ G. There are six navigational axes, which