Metabolic Information Control System Andreas Stephanik, Ralf Hofestädt, Matthias Lange, Andreas Freier Otto-von-Guericke-University of Magdeburg Department of Computer Science Institute of Technical and Business Information Systems Bioinformatics Research Group Universitätsplatz 2, D-39106 Magdeburg, Germany ABSTRACT Systems for the integration of data in molecular biology are becoming more and more important because scientists as well as applications can not always find all relevant data in one database. Another advantage of data integration is the possibility to derive information of a new quality using semantic relations between the integrated data of various databases. With the requirement to feed an application for the simulation of metabolic pathways with necessary data we are developing a system for the integration which is based on a hybrid approach. As a first possibility a datawarehouse is used for an easy and fast access. The storage system of this datawarehouse is an object oriented database system. The second possibility of our hybrid approach of integration is the capability of a homogeneous online access to various data sources such as database systems and flat file based systems via the internet. The components for the data access are modular. Thus they can be created and modified easily using a semi-automatic process. Therefore a mediator based system is available for the integration of data stored in databases and flat files. The applications can access the integrated data via various interfaces such as CORBA, JDBC or TCP/IP. Keywords: Integration, Data Retrieval, Databases, Flat Files, Distributed Data Sources, Simulation of Metabolic Pathways MOTIVATION Scientists in molecular biology use application to analyze, compute or simulate complex scenarios. Those applications need data from various databases, because mostly not all relevant data can be found in one data source. An investigation distinguishes molecular data into 17 categories [1]. Accordingly, about 300 WWW based data sources are listed. The WWW is developing into the most powerful medium for information retrieval. This fact is consequently reflected in molecular biology, so that the majority of databases are accessible using the internet. With regard to persistent data storage two general tech- niques are used: flat files and database systems (DBS) [2]. The public access is mostly done by a WWW server, which acts as middleware between the user interface and the database. In order to take advantage of the potential of these valuable databases it has to be considered that Bioinformatics is an inherently integrative discipline [3], requiring access to data from a wide range of sources. Without the ability to combine these data in new and interesting ways, the field of Bioinformatics would be severely limited in scope. Consequently, the integration of databases can help to derive new information. With these requirements some systems for the integration of biological data have been or will be developed. We have begun to develop a flexible integration system in order to integrate data for several problems and applications. The first application is a metabolic information application applying the integration system. This system for data integration together with the metabolic information application are explained in the following paper. At first a short overview about the topic system for data integration is given. SYSTEMS FOR DATA INTEGRATION Systems for an automated acquisition of information from heterogeneous molecular biology databases for analyzing or simulation of biological processes are primarily based on four technical approaches which are closely related to distributed database management systems (DDBS) [4]. These are: hypertext navigation (e.g. KEGG [5]), data warehouse (e.g. SRS [6], PEDANT [7], HUSAR [8]), multi database query languages (e.g. BioKleisli [9], OPM [10]), agent based techniques (e.g. Multiagents [11]). Those systems enable the access to various databases and the scientists do not have to search for desired data in the forest of the internet. Systems for the integration of data should enable a homogeneous access to dispersed and heterogeneous data sources. The diversities of data sources regarding the data formats and interfaces have to be hidden using