AbstractThe scientific achievements coming from molecular biology depend greatly on the capability of computational applications to analyze the laboratorial results. A comprehensive analysis of an experiment requires typically the simultaneous study of the obtained dataset with data that is available in several distinct public databases. Nevertheless, developing a centralized access to these distributed databases rises up a set of challenges such as: what is the best integration strategy, how to solve nomenclature clashes, how to solve database overlapping data and how to deal with huge datasets. In this paper we present GeNS, a system that uses a simple and yet innovative approach to address several biological data integration issues. Compared with existing systems, the main advantages of GeNS are related to its maintenance simplicity and to its coverage and scalability, in terms of number of supported databases and data types. To support our claims we present the current use of GeNS in two concrete applications. GeNS currently contains more than 140 million of biological relations and it can be publicly downloaded or remotely access through SOAP web services. KeywordsData integration, biological databases I. INTRODUCTION HE integration of heterogeneous data sources has been a fundamental problem in database research over the last two decades [1-6]. The goal is to achieve better methods to combine data residing at different sources, under different schemas and with different formats in order to provide the user with a unified view of the data. Although simple in principle, due to several constrains, this is a very challenging task where both the academic and the commercial communities have been working and proposing several solutions that span a wide range of fields. Life sciences are just one of many fields that take advantage from the advances in data integration methods [3, 4, 6]. This is because the information that describes genes, gene products and the biological processes in which they are involved are dispersed over several databases [7]. In addition, due to the advances in some high throughput techniques, such as gene expression, the experimental results obtained in the laboratory only are valuable after being matched with data stored in public databases [8, 9]. Thus, in order to speed up the investigation process, it is very important to have a centralized access to distributed databases. In this paper, we present GeNS a powerful but easy to use platform that allows the integration of any kind of molecular data. The main advantage of GeNS resides on its schema that University of Aveiro, DETI/IEETA, 3810-193 Aveiro, Portugal has a general organization that supports the addition of new databases and data types without requiring changes in the schema. II. MOTIVATION AND CHALLENGES According to the last release of the Nucleic Acids Research there are about 1170 databases in the field of molecular biology [7]. Each database corresponds to the output of a specific study or community and represents a huge investment whose potential have not been fully explored. Being able to integrate data from multiple sources is important for two reasons. First, because data about one biological entity may be dispersed over several databases, for instance, for a gene, the nucleotide sequence is stored in GenBank [10], the pathway in KEGG Pathway [11] and the expression data in ArrayExpress [12]. Obtaining a unified view of this data is therefore crucial to understand the role of the gene. A second reason consists in the fact that many different databases contain redundant or overlapping information [13]. This can be detected by directly comparing databases. Most of the data stored in these databases is publicly available as custom web interfaces, or as text and XML files [14]. To get this data one has to access each database independently, download and parse the files and finally merge all the results in a unified and consistent dataset. In the last years, several efforts have been made to simplify the process of integrating data from multiple sources. From those we have selected three that seemed the most representative. The first, BioWarehouse [15], contains data from multiple sources including metabolic pathways and enzymes. BioWarehouse uses a database schema oriented to predefined data types, meaning that the addition of new data types implies adding new tables and methods to query them. This database was designed to be more oriented to prokaryotes than for eukaryotes. A different vision has been applied in BioCoRE [16] that uses a more flexible approach to integrate data. According to the authors the system allows the storage of almost all biochemical process. One drawback is the high complexity of the proposed model that contains more than 200 classes. A third approach has been applied by Biozon that contains a simple and abstract schema that supports data based on a hierarchical metamodel [17]. Since the schema is general, in Biozon each relation from the metamodel is explicitly stored in the database. As a consequence the current instance contains about 6.5 billion relations, which decrease performance. Biozon is publicly available through an intuitive GeNS: a Biological Data Integration Platform Joel Arrais, João E. Pereira, João Fernandes and José Luís Oliveira T World Academy of Science, Engineering and Technology 58 2009 850