260 Journal of Digital Information Management Volume 8 Number 4 August 2010 Journal of Digital Information Management ABSTRACT: Schema matching is a basic problem in many database application domains, such as data integration. The problem of schema matching can be formulated as follows, “given two schemas, S i and S j , find the most plausible corre- spondences between the elements of S i and S j , exploiting all available information, such as the schemas, instance data, and auxiliary sources” [24]. Given the rapidly increasing number of data sources to integrate and due to database heterogene- ities, manually identifying schema matches is a tedious, time consuming, error-prone, and therefore expensive process. As systems become able to handle more complex databases and applications, their schemas become large, further increasing the number of matches to be performed. Thus, automating this pro- cess, which attempts to achieve faster and less labor-intensive, has been one of the main tasks in data integration. However, it is not possible to determine fully automatically the different correspondences between schemas, primarily because of the differing and often not explicated or documented semantics of the schemas. Several solutions in solving the issues of schema matching have been proposed. Nevertheless, these solutions are still limited, as they do not explore most of the available information related to schemas and thus affect the result of integration. This paper presents an approach for matching schemas of heterogeneous relational databases that utilizes most of the information related to schemas, which indirectly explores the implicit semantics of the schemas, that further improves the results of the integration. Categories and Subject Descriptors H.2.4 [Database Management]: Systems – relational databases, transaction processing General Terms: Algorithms, Management, Measurement, Performance Keywords: Database integration, Schema matching, Heterogeneous, Biomedical database Received: 30 July 2010; Revised: 11 December 2009; Accepted: 30 December 2009 1. Introduction A database schema comprises the gross structure and con- straints on the database. Database schemas often do not pro- vide explicit semantics for their data. Database heterogeneities, or differences, can make access to information intricate [22]. Thus, a heterogeneous database that unites various existing databases, which support different schemas and technologies, by providing a uniform database schema and querying capabili- ties is critically required [22]. The process of integrating data from multiple, heterogeneous sources are called heterogeneous database integration [22]. This process is made harder due to heterogeneities at the following levels: (i) syntactic hetero- geneity – differences in the language used for representing the elements; (ii) structural heterogeneity – differences in the types, structures of the elements; (iii) model/representational heterogeneity – differences in the underlying models; and (iv) semantic heterogeneity – where the same real world entity is represented using different terms or vice-versa. Schemas support declarative access to and manipulation of data. Thus, they represent the prime interface for establishing interoperability between tools that depend on shared data. The heterogeneous database integration, or simply database integration, aims at providing a uniform and consistent view, the so-called global schema, over a set of autonomous and hetero- geneous data sources, so that data residing in different sources can be accessed as if it was in a single schema. In practice, data integration is often done incrementally by starting with a simple global schema and adding new data sources when needed. The integration of a new data source into an existing global schema can be performed in two steps, a matching and a data transformation step. In the first step, the source schemas are compared against each other to discover their similar and dis- tinct elements. While the distinct elements and their instances can be taken over from the data source, the correspondences between the similar elements are needed in the second step to generate queries for transforming their instances from the source schema into the global schema [22]. Schema matching is a basic problem in many database ap- plication domains, such as data integration. Schema matching is a fundamental operation in the manipulation of schema in formatting match, which takes two schemas that correspond semantically to each other. Schema matching is typically performed manually, perhaps supported by a graphical user interface. Manually specifying schema matches is a tedious, time consuming, error-prone, and therefore expensive process, this is a growing problem given the rapidly increasing number of data sources to integrate. As systems become able to handle more complex databases and applications, their schemas become large, further increasing the number of matches to be performed. The level of effort is at least linear in the number of matches to be performed, maybe worse than linear of one An Approach for Matching Schemas of Heterogeneous Relational Databases Yaser Karasneh 1 , Hamidah Ibrahim 2 , Mohamed Othman, Razali Yaakob 1, 2 Department of Computer Science Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 Serdang Selangar D. E., Malaysia 1 karasneh@gmail.com, 2 hamidah@fsktm.upm.edu.my