1 Ontology-based information integration using INDUS system Doina Caragea’, Jie Bao, Jyotishman Pathak and Vasant Honavar Artificial Intelligence Research Laboratory, Department of Computer Science, Iowa State University, Ames, IA 50010, USA ABSTRACT INDUS (Intelligent Data Understanding System) is a feder- ated, query-centric system for information integration and knowledge acquisition from distributed semantically hetero- geneous data sources. INDUS employs ontologies (con- trolled vocabularies of domain specific terms, and relation- ships among terms) and inter-ontology mappings, to enable a user to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to a user-supplied ontology. 1 INTRODUCTION Ongoing transformation of biology from a data-poor science into an increasingly data-rich science has resulted in a large number of autonomous data sources (e.g., repositories of protein sequences, structures, expression patterns, interac- tions). This has led to unprecedented, and as yet, largely unrealized opportunities for large-scale collaborative dis- covery in a number of areas: characterization of macromo- lecular sequence-structure-function relationships, discovery of complex genetic regulatory networks, among others. At present, there are hundreds of databases of interest to molecular biologists alone [Discala et al., 2000]. Because the data repositories are typically autonomous, and often focused on specific subfields of biology, ontological (and hence semantic) differences among them are simply un- avoidable. However, in exploring specific scientific ques- tions of interest, scientists often need to be able to retrieve and analyze data from multiple sources. Effective use of such data in a given context requires reconciliation of se- mantic differences among the relevant data sources from a user’s point of view. Hence, there is an urgent need for tools to support rapid and flexible assembly and analysis of data from semantically heterogeneous data sources [Jagadish and Olken, 2003]. 2 APPROACH INDUS is a federated, query-centric system for data integra- tion and knowledge acquisition from distributed, semanti- cally heterogeneous data (See Fig. 1). INDUS makes ex- plicit data source specific information, such as the data source schema and (the typically implicit) data source on- * To whom correspondence should be addressed. tologies. The resulting ontology-extended data sources [Caragea et al., 2004] enable users to specify semantic cor- respondences between the user ontology and the data source ontologies by specifying inter-ontology mappings. Fig. 1. INDUS: a system for data integration and knowl- edge acquisition from semantically heterogeneous distrib- uted data. This enables each user to view a collection of autonomous, semantically heterogeneous, distributed data as though they were a collection of inter-related tables structured according to an individual user’s ontology. Thus, users can interact with and explore data sources of interest to them from mul- tiple points of view simply by changing their perspective (i.e., user ontology and semantic correspondences between the user ontology and the data source ontologies). Queries posed using terms in the user ontology are transformed, us- ing a sound query rewriting algorithm, into queries that can be answered by the individual data sources. The results are expressed in terms of the user’s ontology [Caragea et al., 2004] (See Fig. 1). 3 INDUS PROTOTYPE We have completed the implementation of a working proto- type of the INDUS system to enable biologists with some familiarity with the relevant data sources to rapidly and flexibly assemble data sets from multiple data sources and to query these data sets. This can be done by specifying a user ontology, simple semantic mappings between data source specific ontologies and the user ontology and queries – all without having to write any code. An initial version of the INDUS software and documentation are available at www.cild.iastate.edu/GM066387_homepage.htm .