Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 365–373 Contents lists available at ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: www.elsevier.com/locate/websem Invited paper Scalable reduction of large datasets to interesting subsets Gregory Todd Williams , Jesse Weaver, Medha Atre, James A. Hendler Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY, USA article info Article history: Received 2 February 2010 Received in revised form 11 May 2010 Accepted 10 August 2010 Available online 19 August 2010 Keywords: Billion Triples Challenge Scalability Parallel Inferencing Query Triplestore abstract With a huge amount of RDF data available on the web, the ability to find and access relevant information is crucial. Traditional approaches to storing, querying, and reasoning fall short when faced with web-scale data. We present a system that combines the computational power of large clusters for enabling large- scale reasoning and data access with an efficient data structure for storing and querying the accessed data on a traditional personal computer or other resource-constrained device. We present results of using this system to load the 2009 Billion Triples Challenge dataset, materialize RDFS inferences, extract an “interesting” subset of the data using a large cluster, and further analyze the extracted data using a personal computer, all in the order of tens of minutes. © 2010 Elsevier B.V. All rights reserved. 1. Introduction The semantic web is growing rapidly with vast amounts of RDF data being published on the web. RDF data is now available from many sources across the web relating to a huge variety of topics. Examples of these RDF datasets include the Billion Triples Challenge 1 dataset (collected by a webcrawler from RDF documents available on the web), the Linked Open Data project 2 (in which a number of independent datasets are linked together using com- mon URIs), and the recent conversions of governmental datasets to RDF. 3 With so much structured data available on the web, it is impor- tant to be able to find, extract and use just the data that is relevant to a particular problem in a reasonable amount of time. Several factors make this difficult, including different modeling of simi- lar data across datasets, the lack of tools to perform reasoning on large datasets, and the challenge of interacting with and analyzing large datasets in realtime. Extracting specific subsets of interest from very large datasets can alleviate the challenge of analysis, but can also be a time consuming task despite often being simplistic Corresponding author at: Department of Computer Science, Rensselaer Poly- technic Institute, 110 8th Street, Troy, NY 12180, USA. Tel.: +1 518 276 4431; fax: +1 518 276 4464. E-mail addresses: willig4@cs.rpi.edu (G.T. Williams), weavej3@cs.rpi.edu (J. Weaver), atrem@cs.rpi.edu (M. Atre), hendler@cs.rpi.edu (J.A. Hendler). 1 http://challenge.semanticweb.org/, accessed 2010/02/01. 2 http://linkeddata.org/, accessed 2010/02/01. 3 http://data-gov.tw.rpi.edu/ and http://data.gov.uk/ being the most prominent, accessed 2010/02/01. in nature. Many tools require lengthy loading and indexing times which cannot be amortized when used to support a small number of queries meant to extract a subset of data. We propose a composition of approaches that addresses these issues. In particular, we present a system that makes use of a large cluster/supercomputer to allow RDFS materialization of inferences over large datasets and efficient extraction of specific data subsets. We also make use of a compressed triplestore that may be used to directly answer interactive queries on a wide range of (non-cluster) systems. We evaluate this system on the Billion Triples Challenge 2009 (BTC2009) dataset. 4 The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 presents a use case that motivates the work presented in the rest of this paper. Section 4 describes the core components of our system, and in Section 5 we evaluate our sys- tem by performing inferencing on the dataset, extracting relevant data, and loading the reduced dataset into a compressed triplestore. Finally, Section 6 concludes the paper and discusses work we hope to pursue in the future. 2. Related work Our system consists of three components: parallel inferencing, parallel query, and a compressed triplestore. In this section we present works related to these components. 4 http://vmlion25.deri.ie/index.html, accessed 2010/02/01. 1570-8268/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2010.08.002