QUERY DRIVEN SIMULATION AS A TOOL FOR GENETIC ENGINEERS John A. Miller², Jonathan Arnold‡, Krys J. Kochut², A. Jamie Cuticchia‡ and Walter D. Potter² ² Department of Computer Science and Artificial Intelligence Programs ‡ Department of Genetics University of Georgia ABSTRACT Simulations/animations of genetic structures and functions, simula- tions of actual or conceived experiments, and animations of algorithms such as simulated annealing, which is used to reconstruct a chromosome from its clonable DNA fragments, will be useful to genetics researchers and students alike. In this paper, we discuss the design of an integrated simulation/object-oriented database system that can be used by genetic engineers to understand better and to visualize the objects of their study as well as their experimental procedures. Such a system can provide a solid foundation for Computer-Aided Genetic Engineering (CAGE). 1. INTRODUCTION Storing genome mapping information on organisms is currently the major unsolved problem of the Human Genome Initiative. Relational data- bases are beginning to be used to store the vast amount of genetic informa- tion that is being collected [Cuti91b] and [Rudd90, Rudd91]. We are currently designing an object-oriented database to store such information. Changing to this paradigm produces many significant advantages: storage of complex objects and images, such as gel patterns, more importantly the integration (actually encapsulation) of methods with the data, and lastly, a ready made framework to plug new modeling and methodological modules into for exploration via simulation. In other words, the database stores the data and its behavior together as objects. Now in addition to knowing about a certain gene, the database can also know about its behavior (e.g., how it codes for a protein), the methods necessary to place it on a physical or genetic map [Pear88, Land87, Suls88], or the search methods to link it to existing databases, such as GenBank. The ability to store arbitrarily complex methods in the object-oriented database allows simulation/animation features to be tightly integrated with the database. This is the idea behind query driven simulation [Mill89, Mill90, Pott90, Mill91a, Mill91b, Mill91c, Mill91d, Koch91]. By simply submitting a query to a query driven simulation system, information about objects (which may be simulation generated) will be displayed. For example, the scientist can watch as a gene or DNA fragment is being integrated into a physical map. 2. BACKGROUND We have developed a Contig Mapping and Analysis Package, CMAP [Cuti91b], which provides a foundation for computer-aided reverse genetics by organizing information about DNA fragments derived from an organism’s genome into a physical map. The user can store a variety of information about a particular segment of DNA in this relational database. This information or collection of attributes on clones can be both descrip- tive, such as any genes contained in a particular DNA fragment, or experi- mental, such as hybridization profiles for comparison with other fragments. In addition, this relational database is coupled with one method for order- ing the DNA fragments along the chromosome to form a physical map. The user interface is designed to minimize both the learning curve associ- ated with database usage, while eliminating the possibility of entering data outside the ranges of acceptable attribute values, the database constraints, through error checking protocols. Queries are currently accomplished by the use of SQL (Structured Query Language), which is the de facto stan- dard query language for relational databases (and is the basis for some object-oriented query languages). SQL gives the user the ability to formu- late queries based on any combination of attributes contained within the database. In order to eliminate the need for novice users to learn SQL, an interface was designed allowing users to build queries by menu choices. The CMAP database is a first step toward the production of an object- oriented database that will allow researchers to store and retrieve informa- tion vital for the implementation of reverse genetics through the activation of queries and more complicated methods. Physical mapping involves the generation of an ordered list of clones, a list which represents the linear arrangement of clones along a chromosome [Coul86, Olso86, Suls88, Link91], much as genetic loci are ordered with respect to their position on linkage maps of chromosomes. This process of going from a DNA fragment to its function, reverse genet- ics, relies on linking the data on a DNA fragment, such as its sequence, to other databases, such as GenBank and the genetic map. Making these link- ages requires the activation of specific methods, such as FASTA [Pear88] for searches of GenBank to find related DNA sequences or proteins or MAPSEARCH [Rudd91] to place the restriction map of a clone in the con- text of the organism’s restriction map [Koha87]. In this way a physical map and its associated databases become a powerful tool for carrying out systematically the approach of reverse genetics. The CMAP software package is designed for physical mapping experiments. Clonal attributes, such as chromosome localization and asso- ciated known genes can be stored, as well as raw data useful in physical mapping. By using hybridization profiles of clones with oligonucleotide probes as a measure of overlap [Lehr90, Crai90], CMAP can also produce data files which can be used by programs to order the clones accurately [Mich87]. The degree to which two clones overlap determines the degree of similarity in their hybridization profiles. We define d ab as the difference in hybridization profiles between clones a and b . Thus, if clones a and b differ in the hybridization results for 12 probes, then d ab = 12. We define D as the sum of the linking d ab ’s across the entire order of clones. The mapping procedure involves the minimization of this value, D = i =1 Σ n d a (i )b (i ) where a (i ) = b (i -1) for i> 1. This is essentially a Traveling Salesman Problem (TSP) where the size of the problem is quite large (the number of clones, n, is in the hundreds). Since TSP is an NP-Hard [Gare79] optimi- zation problem, simulated annealing is used to solve this problem. (Simu- lated annealing is not guaranteed to give optimal solutions, but in practice tends to give highly satisfactory solutions.) Once an order of fragments has been reached, CMAP can update the linkage of clones with this method to create the linear arrangement within the database.