Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 365–373
Contents lists available at ScienceDirect
Web Semantics: Science, Services and Agents
on the World Wide Web
journal homepage: www.elsevier.com/locate/websem
Invited paper
Scalable reduction of large datasets to interesting subsets
Gregory Todd Williams
∗
, Jesse Weaver, Medha Atre, James A. Hendler
Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY, USA
article info
Article history:
Received 2 February 2010
Received in revised form 11 May 2010
Accepted 10 August 2010
Available online 19 August 2010
Keywords:
Billion Triples Challenge
Scalability
Parallel
Inferencing
Query
Triplestore
abstract
With a huge amount of RDF data available on the web, the ability to find and access relevant information
is crucial. Traditional approaches to storing, querying, and reasoning fall short when faced with web-scale
data. We present a system that combines the computational power of large clusters for enabling large-
scale reasoning and data access with an efficient data structure for storing and querying the accessed
data on a traditional personal computer or other resource-constrained device. We present results of
using this system to load the 2009 Billion Triples Challenge dataset, materialize RDFS inferences, extract
an “interesting” subset of the data using a large cluster, and further analyze the extracted data using a
personal computer, all in the order of tens of minutes.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
The semantic web is growing rapidly with vast amounts of
RDF data being published on the web. RDF data is now available
from many sources across the web relating to a huge variety of
topics. Examples of these RDF datasets include the Billion Triples
Challenge
1
dataset (collected by a webcrawler from RDF documents
available on the web), the Linked Open Data project
2
(in which a
number of independent datasets are linked together using com-
mon URIs), and the recent conversions of governmental datasets to
RDF.
3
With so much structured data available on the web, it is impor-
tant to be able to find, extract and use just the data that is relevant
to a particular problem in a reasonable amount of time. Several
factors make this difficult, including different modeling of simi-
lar data across datasets, the lack of tools to perform reasoning on
large datasets, and the challenge of interacting with and analyzing
large datasets in realtime. Extracting specific subsets of interest
from very large datasets can alleviate the challenge of analysis, but
can also be a time consuming task despite often being simplistic
∗
Corresponding author at: Department of Computer Science, Rensselaer Poly-
technic Institute, 110 8th Street, Troy, NY 12180, USA. Tel.: +1 518 276 4431;
fax: +1 518 276 4464.
E-mail addresses: willig4@cs.rpi.edu (G.T. Williams), weavej3@cs.rpi.edu
(J. Weaver), atrem@cs.rpi.edu (M. Atre), hendler@cs.rpi.edu (J.A. Hendler).
1
http://challenge.semanticweb.org/, accessed 2010/02/01.
2
http://linkeddata.org/, accessed 2010/02/01.
3
http://data-gov.tw.rpi.edu/ and http://data.gov.uk/ being the most prominent,
accessed 2010/02/01.
in nature. Many tools require lengthy loading and indexing times
which cannot be amortized when used to support a small number
of queries meant to extract a subset of data.
We propose a composition of approaches that addresses these
issues. In particular, we present a system that makes use of a large
cluster/supercomputer to allow RDFS materialization of inferences
over large datasets and efficient extraction of specific data subsets.
We also make use of a compressed triplestore that may be used to
directly answer interactive queries on a wide range of (non-cluster)
systems. We evaluate this system on the Billion Triples Challenge
2009 (BTC2009) dataset.
4
The rest of the paper is organized as follows. Section 2 reviews
related work. Section 3 presents a use case that motivates the work
presented in the rest of this paper. Section 4 describes the core
components of our system, and in Section 5 we evaluate our sys-
tem by performing inferencing on the dataset, extracting relevant
data, and loading the reduced dataset into a compressed triplestore.
Finally, Section 6 concludes the paper and discusses work we hope
to pursue in the future.
2. Related work
Our system consists of three components: parallel inferencing,
parallel query, and a compressed triplestore. In this section we
present works related to these components.
4
http://vmlion25.deri.ie/index.html, accessed 2010/02/01.
1570-8268/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.websem.2010.08.002