ISSC 2012, NUI Maynooth, June 28-29 Runtime Characterisation of Triple Stores: An Initial In- vestigation Long Cheng 1,2 , Spyros Kotoulas 2 , Tomas Ward 1 , Georgios Theodoropoulos 2 1 Department of Electronic Engineering National University of Ireland, Maynooth email: {lcheng,tomas.ward}@eeng.nuim.ie 2 IBM Research Ireland email: {spyros.kotoulas,geortheo} @ie.ibm.com _______________________________________________________________________________ Abstract — The Semantic Web is considered a data integration system for different content and ap- plications, in which every item has a specified meaning that machines can understand and process without the intervention of a human. Triple stores are the backbone of this “web of data”, allowing sto- rage and retrieval of semi-structured data as linked data usually formatted as RDF. Although there have been attempts at benchmarking the response times and query throughput of individual triple stores, there has been no systematic study of the impact of query implementation in terms of perfor- mance. In this work, we analyze the general querying process of popular triple stores and construct and measure some core metrics. Using LUBM, we choose three queries to perform experiments on a standard server and report on detailed experimental results. These results will be useful in designing future distributed systems and optimizing architectures for RDF data processing. Keywords – Triple Store System, RDF, LUBM _______________________________________________________________________________ I INTRODUCTION Since the advent of Linked Data, the so-called Semantic Web is now becoming mainstream. It pos- sesses special characteristics such as amenability to machine processing, information lookup and knowl- edge inference that the traditional web can’t achieve. It is increasingly prevalent particularly among gov- ernments and enterprise who see it as a more flexible way to represent their data. Another notable exam- ple is the availability of several datasets from multi- ple domains as Linked Data, such as general knowl- edge (DBpedia), bioinformatics (Uniprot), GIS (geo- names, linkedgeodata), and web-page annotations (schema.org, RDFa, microformats). In tandem with the increasing availability of such data, and corre- sponding technologies, an increasing number of software platforms now use RDF (e.g. the BBC web- site). This web is build on the W3C’s Resource De- scription Framework (RDF) [1], which has described the semantic web data model in the form of subject- predicate-object (SPO) expressions based on the statement of resources and their relationships – these expressions are known as RDF triples. As an exam- ple, contact information about a person named ‘Jack’ can be represented using the following triples: (Jack, e-mail, jack@123.com) (Jack, mobile, 01891234567) (Jack, address, Dublin). These triples convey the information that Jack has email jack@123.com, has mobile number 01891234567 and has Dublin as an address. SPARQL is an SQL-like query language used to express queries on databases which store data as RDF. It incorporates conjunction, disjunction and option patterns. For example, the query with the fol- lowing three triple patterns {?x e-mail ?y . ?x mobile ?z . ?x address Dublin } describes a basic graph pattern (BGP) which can be used to find out the e-mail and mobile number of all people who live in Dublin. A Triple Store System (TSS) is used to store data according to the RDF data Model and op- tionally provides the ability of inferring implicit tri- ples. Generally these TSS can be divided into three types [2]: native stores, which has a database engine optimized for RDF processing, DBMS-backed stores, representing the RDF Model in relational schema backed by a relational DBMS and hybrid stores, which support both architectures. There are many popular TSS implementations available e.g. Jena [3], Sesame [4], RDF-3X [5],Virtuoso [6] and the engi- neering of TSS is currently very active. Along with the growth in new TSS implemen- tations there has been a corresponding increase in in- terest in relevant performance evaluations. Liu [7] evaluated 7 RDF storage systems by comparing data loading and query response time over different size datasets generated from the Lehigh University Benchmark (LUBM). Rohlo [8] implemented the queries and datasets from LUBM to compare the per-