1 Learning Classifiers from Remote RDF Data Stores Augmented with RDFS Subclass Hierarchies Harris T. Lin * , Ngot Bui , Vasant Honavar * Department of Computer Science Iowa State University Ames, IA 50011, USA htlin@iastate.edu Artificial Intelligence Research Laboratory Center for Big Data Analytics and Discovery Informatics College of Information Sciences and Technology Pennsylvania State University University Park, PA 16802, USA {npb123, vhonavar}@ist.psu.edu Abstract—Rapid growth of RDF data in the Linked Open Data (LOD) cloud offers unprecedented opportunities for analyzing such data using machine learning algorithms. The massive size and distributed nature of LOD cloud present a challenging machine learning problem where the data can only be accessed remotely, i.e. through a query interface such as the SPARQL end- point of the data store. Existing approaches to learning classifiers from RDF data in such a setting fail to take advantage of RDF schema (RDFS) associated with the data store that asserts sub- class hierarchies which provide information that can potentially be exploited by the learner. Against this background, we present a general approach that augments an existing directed graphical model with hidden variables that encode subclass hierarchies via probabilistic constraints. We also present an algorithm ProbAVT that adopts the variational Bayesian expectation maximization approach to efficiently learn parameters in such settings. Our experiments with several synthetic and real world datasets show that: (i) ProbAVT matches or outperforms its counterpart that does not incorporate background knowledge in the form of subclass hierarchies; (ii) ProbAVT remains competitive compared to other state-of-art models that incorporate subclass hierarchies, and is able to scale up to large hierarchies consisting of over tens of thousands of nodes. I. I NTRODUCTION Resource Description Framework (RDF) offers a formal lan- guage for describing structured information on the Web, which emerged as a basic representation format for the Semantic Web over the past decade [1]. Cyganiak [2] estimated in 2011 that there are about 300 interlinked data sets containing over 31 bil- lion triples published in the Linked Open Data cloud covering domains including government, life sciences, geography, social media, and commerce. The increasing availability of large RDF data sets on the web offers unprecedented opportunities for extracting useful knowledge or predictive models from RDF data, and using the resulting models to guide decisions in a broad range of application domains. Indeed, recent effort has considered the use of machine learning approaches, and in particular, statistical relational learning algorithms [3], to extract knowledge from RDF data [4], [5], [6], [7], [8]. However, most existing approaches to learning predictive models from RDF data assume that the learning algorithm has direct access to RDF data. In many settings, it may not be fea- sible to transfer a massive RDF data set from a remote location for local processing by the learning algorithm. Even in settings where it is feasible to provide the learning algorithm direct access to a local copy of an RDF data set, algorithms that assume in-memory access to data cannot cope with RDF data sets that are too large to fit in memory. Lin et al. [6] presented an approach for constructing Relational Bayesian Classifiers (RBCs) [9] from RDF data using statistical queries through the SPARQL endpoint of the RDF store. More recently, Lin et al. [5] have proposed extensions of this approach for learning a class of generative models from a network of interlinked RDF data stores. However, RDF triples in an RDF store have often associated with them, RDF Schema (RDFS) [10] that specify a set of classes; these classes organize RDF objects (subjects and objects of predicates) and predicates into type hierarchies as well as domain and range restrictions on RDF predicates (i.e., the type of RDF objects that can appear as subjects or objects of a predicate respectively). RDF schema offer a means to view RDF data at different levels of abstraction. For example, an individual can be described as a student at one level of abstraction; or as an undergraduate or a graduate at a finer level of abstraction; or (in the case of an undergraduate) as a freshman, sophomore, junior or senior. RDF schema offer the possibility of learning classifiers that are expressed in terms of abstract attribute values leading to simpler, accurate and easier-to-comprehend models that are expressed using familiar hierarchically related attributes. Abstraction provides a form of regularization to minimize overfitting (the finer the level of granularity of description used, the smaller the