Semantic Integration of Heterogeneous Databases on the Web Niladri Chatterjee Madhav Krishna Department of Mathematics Division of Computer Engineering Indian Institute of Technology Delhi Netaji Subhas Institute of Technology New Delhi – 110 016, India New Delhi – 110 075, India Email: niladri@maths.iitd.ernet.in Email: madhkrish@gmail.com Abstract The Web is replete with databases, many of which are modeled on the relational paradigm. Currently, for the purpose of simultaneous querying data from multiple databases, the federated database technique is used extensively. However, the effectiveness of such a technique is suspect when it comes to querying heterogeneous databases. Therefore, it becomes imperative to develop an efficient methodology for the semantic integration of heterogeneous online databases. This may be realized by defining a mapping from a relational database to a description that utilises the Resource Description Framework (RDF). Such a representation would be machine processable and would make the semantics as expressed by databases more explicit and, thereby, facilitate their integration. 1. Introduction Mapping from a relational database (or ER diagram) to an RDF representation has already been proposed in [1]. In the present work we extend the idea so as to include all features of ER diagrams (such as aggregation, specialization, multi-valued attributes) as well as other important aspects of database design such as enforcing integrity constraints. This is done while keeping in mind the need for an efficient methodology for semantic integration of heterogeneous databases [2] on the Web. Integration of heterogeneous databases requires the determination of semantic matches among the schema of the participating databases – a process often referred to as “schema matching”. This is indeed a difficult and time consuming task and has been addressed primarily by employing techniques such as: a) Rule Based matching: A number of manually created rules are applied to determine semantic matches among databases. These rules utilize the information provided by a database schema - names of the elements of the database, integrity constraints, data types etc. An example of a rule-based system is the ‘TranScm’ system [3] that employs rules, such as, “two elements match if they have the same name” (synonyms). A major disadvantage of the rule-based system is that crafting rules is a manual process. Also, it is extremely difficult to devise rules that may take advantage of the data instances contained in the database. b) Learning based matching: Various learning based models are employed so that the matching process may be aided by previous matches. In this technique, many probabilistic models can also be used that may effectively utilize the information contained in data instances. For example, the SemInt System [4] uses an artificial neural network that matches schema elements based on element specifications (data types, the existence of constraints) and statistics of data instances (maximum, minimum, average, and variance). It seems obvious, therefore, that determining semantic matches between databases would be a lot simpler if the elements of these databases expressed the desired semantics in an explicit, accurate and objective fashion – something that can be achieved at the database design stage or during the mapping of a data model to a databases. This paper is organized as follows. Section 2 summarises the paradigms of database heterogeneity. Section 3 describes the proposed methodology of mapping key features of an ER diagram representation to an RDF format with the extensive use of user-defined URIref vocabularies. Section 4 discusses the database integration methodology. Section 5 presents a brief account of related work, and finally, Section 6 presents our concluding remarks. 2. Heterogeneity in Databases Heterogeneity in databases can be broadly classified into three categories [5] as schematic, semantic and