Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain * , Latifur Khan † , Murat Kantarcioglu ‡ and Bhavani Thuraisingham § Department of Computer Science University of Texas at Dallas 800 West Campbell Road, Richardson, TX 75080-3021 * mfh062000@utdallas.edu † lkhan@utdallas.edu ‡ muratk@utdallas.edu § bhavani.thuraisingham@utdallas.edu Abstract—Cloud computing is the newest paradigm in the IT world and hence the focus of new research. Companies hosting cloud computing services face the challenge of handling data intensive applications. Semantic web technologies can be an ideal candidate to be used together with cloud computing tools to provide a solution. These technologies have been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). With the explosion of semantic web technologies, large RDF graphs are common place. Current frameworks do not scale for large RDF graphs. In this paper, we describe a framework that we built using Hadoop, a popular open source framework for Cloud Computing, to store and retrieve large numbers of RDF triples. We describe a scheme to store RDF data in Hadoop Distributed File System. We present an algorithm to generate the best possible query plan to answer a SPARQL Protocol and RDF Query Language (SPARQL) query based on a cost model. We use Hadoop’s MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efﬁcient and can easily handle billions of RDF triples, unlike traditional approaches. Keywords-RDF; Hadoop; Cloud; Semantic Web; I. I NTRODUCTION Cloud computing is now the center of attraction for large enterprises looking for ways to be more cost efﬁcient. A lot of research is going on in this arena to make cloud comput- ing more efﬁcient, secure and affordable. Semantic web is an evolving technology which can be utilized for this purpose. Semantic Web technologies are being developed to present data in a more efﬁcient way so that such data can be retrieved and understood by both human and machine. At present, web pages are published in plain html ﬁles which are not suitable for reasoning. Instead, the machine treats these html ﬁles as a bag of keywords. Researchers are developing Semantic Web technologies that have been standardized to address such inadequacies. The most prominent standards are Resource Description Framework 1 (RDF) and SPARQL Protocol and 1 http://www.w3.org/TR/rdf-primer RDF Query Language 2 (SPARQL). RDF is the standard for storing and representing data and SPARQL is a query language to retrieve data from an RDF store. The power of these Semantic Web technologies can be successfully harnessed in cloud Computing environment to provide the user with capability to efﬁciently store and retrieve data for data intensive applications. Synergy between the semantic web and cloud computing ﬁelds offers great beneﬁts, such as standards for data representation across frameworks. The need for cloud computing hosting companies to handle data intensive applications scalably a major issue. Even with huge amounts of data, data intensive systems should not be bogged down and their performance should not deteriorate. Designing such scalable systems is not a trivial task. When it comes to semantic web data such as RDF, we face similar challenges. With storage becoming cheaper and the need to store and retrieve large amounts of data, developing systems to handle trillions of RDF triples requiring tera/peta bytes of disk space is no longer a distant prospect. Researchers are already working on billions of triples [16], [19]. Competitions are being organized to encourage researchers to build efﬁcient repositories 3 . At present, there are just a few frameworks (e.g. Jena 4 , Sesame 5 , BigOWLIM 6 ) for Semantic Web technologies, and these frameworks are not scalable for large RDF graphs. Jena Semantic Web Framework is one of the most popular ones which has several models to store data: in-memory model, SDB model 7 , TDB model 8 , etc. They are all designed for a single machine scenario; hence, they are not scalable when it comes to terabytes of data. Only 10 million triples can be processed in a Jena in-memory model running on a machine having 2 GB of main memory. Another such framework 2 http://www.w3.org/TR/rdf-sparql-query 3 http://challenge.semanticweb.org 4 http://jena.sourceforge.net 5 http://www.openrdf.org 6 http://www.ontotext.com/owlim/big/index.html 7 http://jena.hpl.hp.com/wiki/SDB 8 http://jena.hpl.hp.com/wiki/TDB