International Journal of Computer Applications (0975 – 8887) Volume 104 – No.7, October 2014 31 An Efficient and Scalable RDF Indexing Strategy based on B-Hashed-Bitmap Algorithm using CUDA Sharmi Sankar a , Munesh Singh a , Awny Sayed * , Jihad Alkhalaf Bani-Younis a a College of Applied Sciences, Ibri, Postal Code 516, Sultanate of Oman, * Faculty of Science, Minia University, Egypt ABSTRACT Indexing enormous databases such as RDF has been a focus of intense research. As is well understood, indexing plays a pivotal role in speeding up data retrieval operations and query performance. Besides expediting search, an index can motivate new data-store schemes and technologies that can possibly revolutionize large data-analytics engine design, more often relevant to semantic web. Due to the proliferation of internet and the ease of creating and generating data on the fly - handling, storing and the subsequent semantic processing has proven to be a major bottleneck for the RDF data community. Handling data of such scale and magnitude requires a parallel approach as provided by the GPUs (Graphical processing units). In this paper, a new efficient and scalable index is proposed that uses a combination of B+ trees, hashing and sparse matrices. These data structures have an edge over others in terms of their implementation as a parallel algorithm using the CUDA (Compute Unified Device Architecture) framework meant to program massively parallel GPU multicores. So far, RDF data has been mostly implemented either as a RDBMS or as a non-native data- store, in both cases the sequential indexing strategy fails miserably with the scaling of the data-store. Parallel implementation of indices provides a suitable option for dealing with scalable and dynamically generated data over distributed networks. The crucial sparse matrix part of the proposed index is benchmarked against different CUDA memory implementations to derive optimal matrix processing options. The sparse matrix search is profiled using cudamemchk and visual profiler for identifying bottlenecks and inconsistencies in thread execution called thread divergence. Benchmarking the data provides promising results for a B+ tree based index coupled with hashing and sparse matrix implementations. Keywords RDF, B+ tree, hashmap, sparse matrix, CUDA, GPU. 1. INTRODUCTION There are several initiatives to improve the situation and reduce the drawbacks of the current web. One of them is a Semantic Web, which is coined by the W3C founder Tim Berners-Lee in a Scientific American article that is describing the future of the Web [1]. The Semantic Web gives better structure and computer-understandable meaning that offers a common framework for sharing data across applications, enterprise and communities. The Semantic Web initiates to define information on the web in a precise machine comprehensible format. The web in its existing incarnation provides information in human understandable formats, but the meaning of this information and its relation to other pieces of information elsewhere on the web are not well-defined. Semantic Web data uses common schemas to describe data from disparate sources. Machines capable of reading this data could comprehend the data, for example inferences could be made about the data based on information from other datasets (BernersLee,2001).Semantic Web information is often stored in RDF in the form of triples (subject, property, object). A combination of many RDF triples forms an RDF graph. RDF is a metadata model for web resources, and is the reason it is referred as a Resource Description Framework (RDF). A number of storage implementations and schemes have been proposed that use databases to cache RDF triples. Some implementations maintain RDF-specific information in the application layer, and some store the RDF schema at the database level. When stored at the application level, the application stays database-independent, and compromises in terms of performance and scalability is revealed. When the RDF schema is implemented at the database level, RDF structure can be exploited to obtain efficiency using existing database models. These reviews focus on existing state of the art of RDF database storage schemes. The simplest way to store RDF data is in a triple store, essentially one large table with three columns for subject, predicate, and object. Variations on the triple store have shown improvements in efficiency and have reduced the number of self joins needed when issuing complex queries. RDF storage has witnessed numerous research initiatives in varied domains. Despite of the best efforts, a scalable, efficient and fast index has eluded researcher’s grasp. A typical RDF data-store consists of billions of triples (a triple comprises of subject, predicate and object) with extensive and wide range of self- dependencies among the subject and the object field values. The outcome of which results in recursive self-joins with an added cost to the query optimizer [1]. Besides self-joins, unions and null values it also generates countless performance related issues. There exists broadly two ways to deal with these disputes, either to re-design the RDF data-store from scratch using a new setup for representing the triples along with the modified query engine design or to explore faster and more efficient indexing strategies that provide impeccable query processing time irrespective of scalability. RDF repositories usually create indexes on one or more components of an RDF triple. Since the volume of data (RDF- triples) is quite large, a typical index should at least be logarithmic in its time complexity. Many index designs have been suggested with most of them relying on B+ tree and hashing. In one of the suggested design [10], a forest of B+ tree is created that uses different combinations of S, P and O. The main drawback of this strategy lies in the complex queries resulting in slow data retrieval. Hexastore sex tuple