Learning Link-Based Na¨ıve Bayes Classifiers from Ontology-Extended Distributed Data Cornelia Caragea 1 , Doina Caragea 2 , and Vasant Honavar 1 1 Computer Science Department, Iowa State University, 2 Computer and Information Sciences, Kansas State University {cornelia,honavar}@cs.iastate.edu dcaragea@ksu.edu Short Paper Abstract. We address the problem of learning predictive models from multiple large, distributed, autonomous, and hence almost invariably se- mantically disparate, relational data sources from a user’s point of view. We show under fairly general assumptions, how to exploit data sources annotated with relevant meta data in building predictive models (e.g., classifiers) from a collection of distributed relational data sources, with- out the need for a centralized data warehouse, while offering strong guar- antees of exactness of the learned classifiers relative to their centralized relational learning counterparts. We demonstrate an application of the proposed approach in the case of learning link-based Na¨ıve Bayes classi- fiers and present results of experiments on a text classification task that demonstrate the feasibility of the proposed approach. 1 Introduction Recent advances in sensors, digital storage, computing, and communications technologies have led to a proliferation of autonomously operated, distributed data repositories in virtually every area of human endeavor. Many groups have developed approaches for querying semantically disparate sources [1–4], for dis- covering semantic correspondences between ontologies [5, 6], and for learning from autonomous, semantically heterogeneous data [7]. One approach to learn- ing from semantically disparate data sources is to first integrate the data from various sources into a warehouse based on semantics-preserving mappings be- tween the data sources and a global integrated view, and then execute a stan- dard learning algorithm on the resulting centralized, semantically homogeneous data. Given the autonomous nature of the data sources on the Web, and the diverse purposes for which the data are gathered, it is unlikely that a unique global view of the data that serves the needs of different users or communities of users under all scenarios exists. Moreover, in many application scenarios, it may be impossible to gather the data from different sources into a centralized warehouse because of restrictions on direct access to the data. This calls for ap- proaches to learning from semantically disparate data that do not rely on direct access to the data but instead can work with results of statistical queries against