RDF Aggregate Queries and Views Edward Hung Yu Deng Department of Computer Science University of Maryland, College Park, MD 20742 ehung,yuzi,vs @cs.umd.edu V.S. Subrahmanian Abstract Resource Description Framework (RDF) is a rapidly ex- panding web standard. RDF databases attempt to track the massive amounts of web data and services available. In this paper, we study the problem of aggregate queries. We develop an algorithm to compute answers to aggregate queries over RDF databases and algorithms to maintain views involving those aggregates. Though RDF data can be stored in a standard relational DBMS (and hence we can ex- ecute standard relational aggregate queries and view main- tenance methods on them), we show experimentally that our algorithms that operate directly on the RDF representation exhibit significantly superior performance. 1 Introduction Resource Description Framework (RDF) is a W3C rec- ommendation endorsed by approximately 300 companies. Though RDF has many complex features, the basic idea is to describe resource, property, value triples specifying that a given resource has a given value for the described prop- erty. RDF databases are expected to store such triples about web pages and other information resources so that users can query the web using a sophisticated, database style query language rather than using simple keyword search supported by most current web search engines. In this paper, we propose the CAA (Compute Aggregates Algorithm) algorithm to efficiently compute aggregate op- erations such as COUNT,SUM,AVG,MIN,MAX and so on. CAA can also handle GROUPBY queries. We sub- sequently define algorithms to maintain aggregate views. These are views involving aggregate queries. We split ag- gregate functions into two categories - distributive and non- distributive aggregates. We provide algorithms (called AMI and AMD) to maintain aggregate views when insertions and deletions are made. In addition, we provide methods to maintain aggregate views when triples are modified (called AMT) and when resources (called AMR) are modified. We also note that RDF databases can be easily stored in re- lational form. As a consequence, standard algorithms to maintain aggregate relational views can be implemented to maintain RDF views. We have implemented this strategy and compared it to our implementation of AMI,AMD,AMR and AMT – our algorithms are much faster than performing view maintenance on the relational version. The organization of this paper is as follows. We first introduce to the reader the basics of RDF and RDF aggre- gates in Section 2. In Section 3, we describe how to extend a commercial RDF language called RDQL (proposed by Hewlett Packard) to support aggregations, and propose the CAA algorithm to compute aggregates. Section 4 proposes the AMI, AMD, AMT and AMR algorithms. The relational approach is described in Section 5. Section 6 describes our prototype implementation of these algorithms. The results show that, when the database is updated, our incremental maintenance algorithms work much faster than a complete recomputation by an order of 10 to 1000 and about 1.8 to 1109 times faster than the relational implementation. We discuss related work in Section 7. 2 Preliminaries and Motivating Examples 2.1 RDF Model RDF’s main goal is to express information about the val- ues of properties of resources. As a consequence, RDF statements express what we call resource, property, value triples. 1 Each resource is expressed via a Uniform Resource Indicator (URI) which looks very similar to a URL. Note that the value of a property can be another resource. Figure 1(b) shows a sample RDF instance. It states, for example, that there is a resource at http://www.artist.net#guyrose, which has a property called fname whose value is “Guy”. 1 The resource, property and value are also often referred to as subject, predicate and object respectively.