Usefulness of Solution Algorithms of the Traveling Salesman Problem in the
Typing of Biological Sequences in a Clinical Laboratory Setting
Javier Garcés Eisele1, Carolina Yolanda Castañeda Roldán2,
Mauricio Osorio Galindo2, Ma. del Pilar Gómez Gil2
1 Universidad de las Américas, Puebla, Depto. de Química y Biología. CIQB.
jgarces@mail.udlap.mx
2 Universidad de las Américas, Depto. de Ing. en Sistemas Computacionales. CENTIA.
ccastane@mail.udlap.mx, josorio@mail.udlap.mx, pgomez@mail.udlap.mx
Abstract
Our concern is to solve the problem of the typing of
deoxyribonucleic acid (DNA) sequences in a
laboratory setting. Here we try to find solution
algorithms for the classification of restriction patterns
which forms part of the above-mentioned problem, in
order to evaluate the amount of information generated
by a given restriction enzyme. A distance matrix is
generated by comparison of each restriction pattern
and used to classify the patterns according to their
similarity. This problem can be mapped to the
Traveling Salesman Problem (TSP). Several known
and new solution algorithms have been tested.
Interestingly, a very simple and modified nearest
neighbor analysis performed best for this kind of
problem. However, when the distance matrix is
replaced by a “distinction matrix” (expresses directly
with the help of a threshold function the similarity (0)
or dissimilarity (1) between restriction patterns) the
results of at least one local search algorithm are
dramatically improved.
1. Introduction
For the TSP, we are given a complete, weighted
graph and we want to find a tour (a cycle through all
the vertices) of minimum weight [1]. One formal
definition of the TSP can be found in [2]. Interestingly,
several problems arising from the analysis of DNA
sequences can be formulated analogous to the TSP,
one of which will be presented and analyzed herein.
DNA is the deoxyribonucleic acid, i.e. the genetic
material that encodes the characteristics of living
things DNA consists of strings of molecules called
nucleotides. There are four nucleotides in DNA
distinguished by its base, each denoted by the first
letter of the base: adenine (A), cytosine (C), guanine
(G) and thymine (T) [3]. A DNA sequence can,
therefore, be treated as a character string using an
alphabet of 4 letters. The sequence of these letters
defines the characteristics of any living being, thus the
knowledge of the sequence or at least part of it allows
the identification of the organism to which the
sequence belongs. Thus different types of sequence
analysis can be employed in a clinical laboratory
setting in order to identify an infectious agent present
in a sample taken from a given patient. The instance
that will be treated is an example of the so-called
sequence-typing problem (STP) applied to the case of
the Human Papilloma Viruses (HPV), which is
associated with the development of cervical cancer [4].
The required sequence analysis may be performed by a
technique called RFLP-PCR (Restriction Fragment
Length Polymorphism coupled to Polymerase Chain
Reaction). Briefly a segment of the viral genome is
analyzed with the help of so-called restriction
enzymes, which cut the segment where a small
substring is located, i.e. the enzyme EcoRI recognizes
the substring GAATTC [5]. The pattern (sizes) of the
generated fragments is then determined as it is
obviously a function of the sequence itself. The HPV
types may then be identified, as long as the
corresponding patterns generated by an enzyme are
different for each virus. Otherwise, combinations of
enzymes have to be used. Until now 48 reference
sequences have been published and more than 180
restriction enzymes are available to perform the typing,
each recognizing a different subsequence or substring.
In order to select an optimal combination of
enzymes to carry out the typing, it is important to
evaluate each enzyme, i.e. how much information is
yielded on average by the enzyme. This requires in a
simple approach to group the restriction patterns
according to their similarity, which means that we have
to determine the distance between each pair of them
and order them linearly according to their similarity.
This in turn yields a distance matrix from which we
have to select a Hamiltonian path or circuit of minimal
weight. Thus, we are confronted with a problem
similar to the TSP. The instances are symmetric but not
always geometric. However, due to the evolutionary
Proceedings of the 14th International Conference on Electronics, Communications and Computers (CONIELECOMP’04)
0-7695-2074-X/04 $ 20.00 © 2004 IEEE