Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores Harris T. Lin and Vasant Honavar Department of Computer Science Iowa State University Ames, IA 50011 USA {htlin,honavar}@iastate.edu Abstract—The emergence of many interlinked, physically distributed, and autonomously maintained RDF stores of- fers unprecedented opportunities for predictive modeling and knowledge discovery from such data. However existing machine learning approaches are limited in their applicability because it is neither desirable nor feasible to gather all of the data in a centralized location for analysis due to access, memory, bandwidth, computational restrictions, and sometimes privacy and confidentiality constraints. Against this background, we consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically we: (i) introduce statistical query based formulations of several representative algorithms for learning classifiers from RDF data; (ii) introduce a distributed learning framework to learn classifiers from multiple interlinked RDF stores that form a chain; (iii) identify three special cases of RDF data fragmentation and describe effective strategies for learning predictive models in each case; (iv) consider a novel application of a matrix reconstruction technique from the field of Computerized Tomography [1] to approximate the statistics needed by the learning algo- rithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner; and (v) report results of experiments with a real-world social network data set (Last.fm), which demonstrate the feasibility of the proposed approach. Keywords-classifier; supervised learning; distributed learn- ing; RDF; SPARQL; linked data I. I NTRODUCTION The growing adoption of a set of best practices, col- lectively referred to as Linked Data, for publishing struc- tured data on the Web [2], has made it possible to link and share many disparate, previously isolated, distributed, autonomously generated and managed data across virtually every domain of human endeavor. The community-driven Linked Open Data (LOD) effort allows structured data to be represented using Resource Description Framework (RDF, [3]) in the form of subject-predicate-object triples (also called RDF triples), which describe a directed graph where the directed labeled edges encode binary relations between labeled nodes. RDF stores and associated query languages such as SPARQL [4] offer the means to store and query large amounts of RDF data. LOD also enables integration of previously isolated distributed data such as data stored in multi-relational databases [5]. At present, LOD include a few hundred linked data sets that together contain Figure 1. A motivating scenario of two RDF stores that are linked to form a chain of RDF stores: Facebook users share posts about news items published in New York Times. in excess of a few trillion RDF triples [6]. These cover a broad range of domains including government, life sciences, geography, social media, and commerce. The emergence of LOD offers unprecedented opportunities for using disparate data sources in predictive modeling and decision making in such domains. We motivate the problem of learning predictive models from multiple interlinked RDF stores using the scenario shown in Fig. 1. In this case, one might want to use data from Facebook and New York Times to predict the interest of a user in belonging to a Facebook group, based on the distribution of tags associated with the New York Times news stories that the user has shared with her social network on Facebook. This is an instance of the node prediction problem [7]. In general, building such predictive models en- tails using information from multiple interlinked, physically distributed, autonomously maintained RDF stores. In such a setting, it is neither desirable nor feasible to gather all of the data in a centralized location for analysis, because of access, memory, bandwidth, and computational restrictions. In other settings, access to data may be limited due to privacy and confidentiality constraints [8], [9]. This calls for techniques for learning predictive models (e.g. classifiers) from multiple interlinked RDF stores that support only indirect access to data (e.g. via a query interface such as SPARQL). Barring Lin et al. [10] who proposed an approach to learning rela- tional Bayesian classifiers [11] from a single remote RDF store using statistical queries against its SPARQL endpoint, to the best of our knowledge, there has been very little work on this problem. Against this background, we consider the problem of learning predictive models from multiple interlinked RDF stores. Specifically we: (i) introduce statistical query based formulations of several representative algorithms for learn-