Usefulness of Solution Algorithms of the Traveling Salesman Problem in the Typing of Biological Sequences in a Clinical Laboratory Setting Javier Garcés Eisele1, Carolina Yolanda Castañeda Roldán2, Mauricio Osorio Galindo2, Ma. del Pilar Gómez Gil2 1 Universidad de las Américas, Puebla, Depto. de Química y Biología. CIQB. jgarces@mail.udlap.mx 2 Universidad de las Américas, Depto. de Ing. en Sistemas Computacionales. CENTIA. ccastane@mail.udlap.mx, josorio@mail.udlap.mx, pgomez@mail.udlap.mx Abstract Our concern is to solve the problem of the typing of deoxyribonucleic acid (DNA) sequences in a laboratory setting. Here we try to find solution algorithms for the classification of restriction patterns which forms part of the above-mentioned problem, in order to evaluate the amount of information generated by a given restriction enzyme. A distance matrix is generated by comparison of each restriction pattern and used to classify the patterns according to their similarity. This problem can be mapped to the Traveling Salesman Problem (TSP). Several known and new solution algorithms have been tested. Interestingly, a very simple and modified nearest neighbor analysis performed best for this kind of problem. However, when the distance matrix is replaced by a “distinction matrix” (expresses directly with the help of a threshold function the similarity (0) or dissimilarity (1) between restriction patterns) the results of at least one local search algorithm are dramatically improved. 1. Introduction For the TSP, we are given a complete, weighted graph and we want to find a tour (a cycle through all the vertices) of minimum weight [1]. One formal definition of the TSP can be found in [2]. Interestingly, several problems arising from the analysis of DNA sequences can be formulated analogous to the TSP, one of which will be presented and analyzed herein. DNA is the deoxyribonucleic acid, i.e. the genetic material that encodes the characteristics of living things DNA consists of strings of molecules called nucleotides. There are four nucleotides in DNA distinguished by its base, each denoted by the first letter of the base: adenine (A), cytosine (C), guanine (G) and thymine (T) [3]. A DNA sequence can, therefore, be treated as a character string using an alphabet of 4 letters. The sequence of these letters defines the characteristics of any living being, thus the knowledge of the sequence or at least part of it allows the identification of the organism to which the sequence belongs. Thus different types of sequence analysis can be employed in a clinical laboratory setting in order to identify an infectious agent present in a sample taken from a given patient. The instance that will be treated is an example of the so-called sequence-typing problem (STP) applied to the case of the Human Papilloma Viruses (HPV), which is associated with the development of cervical cancer [4]. The required sequence analysis may be performed by a technique called RFLP-PCR (Restriction Fragment Length Polymorphism coupled to Polymerase Chain Reaction). Briefly a segment of the viral genome is analyzed with the help of so-called restriction enzymes, which cut the segment where a small substring is located, i.e. the enzyme EcoRI recognizes the substring GAATTC [5]. The pattern (sizes) of the generated fragments is then determined as it is obviously a function of the sequence itself. The HPV types may then be identified, as long as the corresponding patterns generated by an enzyme are different for each virus. Otherwise, combinations of enzymes have to be used. Until now 48 reference sequences have been published and more than 180 restriction enzymes are available to perform the typing, each recognizing a different subsequence or substring. In order to select an optimal combination of enzymes to carry out the typing, it is important to evaluate each enzyme, i.e. how much information is yielded on average by the enzyme. This requires in a simple approach to group the restriction patterns according to their similarity, which means that we have to determine the distance between each pair of them and order them linearly according to their similarity. This in turn yields a distance matrix from which we have to select a Hamiltonian path or circuit of minimal weight. Thus, we are confronted with a problem similar to the TSP. The instances are symmetric but not always geometric. However, due to the evolutionary Proceedings of the 14th International Conference on Electronics, Communications and Computers (CONIELECOMP’04) 0-7695-2074-X/04 $ 20.00 © 2004 IEEE