An Entity Based RDF Indexing Schema Using Hadoop And HBase Fateme Abiri Dept. of Computer Engineering Ferdowsi University Mashhad, Iran Abiri.fateme@stu.um.ac.ir Mohsen Kahani Dept. of Computer Engineering Ferdowsi University Mashhad, Iran Kahani@ferdowsi.um.ac.ir Fatane Zarinkalam Dept. of Computer Engineering Ferdowsi University Mashhad, Iran Fattane.zarrinkalam@stu.um.ac.ir Abstract— Recent development of semantic web has opened new research to design search engines which organize and manage semantic data. The core of a search engine is the indexing system which consists of two main parts: data storage and data retrieval. With the increasing amount of semantic data, the most important goal expected from an indexing system is the ability to store large amount of data and retrieve them as fast as possible. In other words, having a scalable indexing system is one of the major challenges in semantic search engines. In this paper, a scalable method is presented to index the RDF data which utilizes HBase database, a NOSQL database management system, as its underlying data storage. HBase provides random access to massive data on the distributed framework of Hadoop, therefore, it can be a proper option for the management of the massive data. Further, due to the importance and popularity of the entity-based queries, a new schema based on a clustering algorithm is designed to effectively respond to this type of queries. The experimental evaluation shows that the proposed indexing system is effective in terms of improving scalability and retrieval of RDF data. Keywords-RDF Indexing; Agglomerative Clustering Algorithm; Entity Based Queries; NOSQL Database; HBase; Hadoop; I. INTRODUCTION The Semantic Web is an extension of the traditional web. In traditional web, the requests of users are expressed simply by keywords and search engines retrieve the indexed documents in which the keywords occurred [1]. On the other hand, the Semantic Web has been introduced to enable search engines to respond to complex requests of users based on their meaning. So, the relevant information sources have to be structured semantically. To deal with this issue, w3c 1 has introduced a framework which describes the information resources in a semantic structure. This Resource Description Framework is briefly called RDF. RDF documents are in subject-predicate-object expression format and can be interpreted as a graph in which the subjects and objects are nodes and predicates are the edge of the graph. This simple model of representing knowledge can also be readable by machines and automated software agents to exchange knowledge distributed through internet. To be able to response to complex queries, data should be effectively organized. Therefore, as a challenge, developers are interested in investigating different methods of organizing the semantic data. 1 World Wide Web Consortium: www.w3.org The process of organizing data is called data indexing. Each indexing system composes of two main components: data storage and data retrieval. The most important aim expected from such a system is the ability to store large amount of data and retrieve them as fast as possible. To achieve this goal, it’s necessary to design a schema with the ability to scalable index data and respond to complex requests of users in desired time. The user requests should be expressed using SPARQL language. In fact, in an SPARQL query; users tend to extract sub graphs of entities from RDF graphs. As argued in [2], there are five types of SPARQL queries which can be applied to RDF graphs. These types of queries are single triple queried, star shaped queries, entity based queries, path based queries, and graph based queries. According to the recent statistics [3], the majority of the users in their queries, are looking for an entity with its specific attributes. So, the aim of the proposed method is to effectively response to star shaped queries with one path or actually the entity based queries and the single triple queries. The structure of this paper is organized as follows. Related works are reviewed in section II. Then the paper proceeds to concentrate on details of the proposed system in section III. Section IV deals with the experimental evaluation of current system and finally the paper ends with the conclusion and future work of this paper. II. RELATED WORKS There are two key factors which should be considered in designing desired RDF management systems. The first factor is the indexing schema and the second one is the systems which should be considered to index and manage RDF data. In the following, these two factors will be explained. A. RDF Indexing Schema A big challenge of semantic web is storing data and then query processing to retrieve data in desired time. To handle this challenge, a lot of indexing methods have been introduce which can be divided into two groups; the schemas which interpret RDF documents as a graph and the schemas which interpret RDF documents as a set of triples. The first group of schemas discussed in [2,4,5,6,7,8], try to analyze the structure of RDF graph and extract the relationships between nodes. Then, metadata obtained from the analyzed graphs and the triples are indexed together. In this method of indexing, because of using metadata of RDF graphs, the query processing is complex. The purpose of these methods is  WK ,QWHUQDWLRQDO &RQIHUHQFH RQ &RPSXWHU DQG .QRZOHGJH (QJLQHHULQJ ,&&.(   ,((( 