Building Integrated Systems for Data Representation and Analysis in Molecular Biology G. Perrière, F. Chevenet, F. Dorkeld, T. Vermat and C. Gautier Laboratoire de Biométrie, Génétique et Biologie des Populations Université Claude Bernard — Lyon 1 43, bd. du 11 Novembre 1918 69622 Villeurbanne Cedex (France) Abstract of eukaryotic genomes centered on the relations between genomic sequences and their chromosomal localization [8]. In their early stages of development, ColiGene and MultiMap integrated a few sequence analysis methods. But, little by little, we found useful to provide access to the methods associated with a peculiar knowledge base to the other. Moreover, it was obvious that the methods integrated in ColiGene and MultiMap lacked of standardization and that their integration in the knowledge bases was a bit artificial (i.e. they were more associated pieces of software than really integrated tools). This is why we have decided to separate methods from the biological objects, while maintaining communication possibilities between these two kind of knowledge. To this purpose, we have developed two complementary packages: Digit and Misa. The first one was primarily designed as a graphical layer for a knowledge base devoted to the methodology in multivariate analysis: Slot [9]. The genericity of the tools available in Digit allows their use with any knowledge base developed with the Shirka system. After Digit, we have developed Misa, a more specific system which is able to virtually integrate any available sequence analysis method, this due to its modular conception. The modules that are part of the Misa system are now our basis in the building of a more advanced system in which methods — defined as “tasks” — are managed under an “intelligent” system guiding the user in the choices and the chainings to do in a way to perform complex analyses. Biochemical techniques have given to biology experimental tools for sequencing genome fragments. These fragments were of a short length — often corresponding to a single gene — this until the development of extremely fast sequencing methods. By now, large sequencing projects have been started, leading to the availability of huge continuous fragments (300 kb for yeast chromosome III). Besides the problem of storing and representing such an amount of data, there is also the fact that the internal organization of these large fragments is often completely unknown. Consecutively, in the near future, the data and knowledge bases dedicated to molecular biology will be most certainly associated to sequence analysis systems in a way to help study the unannotated fragments. We present here an example of such association between two “biological” object- oriented knowledge bases — ColiGene and MultiMap — and two sets of methods that are able to communicate with them. Then, we show how it is possible to formalize the methods under an object-oriented knowledge representation model for tasks. Introduction Since 1981, our group has developed many tools for nucleotide sequence analysis. One of our first development was the data base ACNUC [1], that allows to structure the GenBank [2], the EMBL [3] and the NBRF- PIR [4] collections under an entity-relationship model. The retrieval system Query, associated with ACNUC, allows to make elaborated queries on these collections. We have also developed a software which is able to interact with the ACNUC data base: the Analseq system [5]. This software provides a set of statistical tools useful for sequence analysis. More recently, we have build four complementary tools for representing biological and methodological knowledge for studying prokaryotic and eukaryotic organisms. This was done using an object- centered approach instead of the classical programming approach used in Query and Analseq. ColiGene is the first object-oriented knowledge base in molecular biology using the Shirka representation system [6] we have ever built. This system is devoted to the representation of mechanisms involved in the regulation of gene expressivity in the bacterium Escherichia coli [7]. MultiMap is a knowledge base dedicated to the analysis Softwares general description All our softwares were developed on SUN SPARCstations using Le_Lisp language [10] and its extension: Aida [11]. Also, efficient functioning of ColiGene and MultiMap requires computers with at least 16 Mbytes in central memory. ColiGene When we built this knowledge base, we choose to represent some aspects of Escherichia coli genetics firstly due to the fact that this organism was one of the best known in biology, and secondly because many studies were previously conducted on it in our group [12-14]. With ColiGene, our aim was not the modelling of the whole knowledge on Escherichia coli genetics but only the modelling of the parts involved in the relationships 89