ClustDB: A High-Performance Tool for Large Scale Sequence Matching
Jürgen Kleffe, Friedrich Möller and Burghardt Wittig
Charité Berlin, Institut für Molekularbiologie und Bioinformatik, Arnimallee 22, 14195 Berlin
E-Mail: juergen.kleffe@charite.de, friedrich.moeller@charite.de,burghardt.wittig@charite.de
Abstract
High throughput sampling of expressed sequence tags
(ESTs) has generated huge collections of transcripts that
are difficult to compare with each other using existing
tools for sequence matching. The major problem is lack
of computer memory. We therefore present a new exact
and memory efficient algorithm for the simultaneous
identification of matching substrings in large sets of
sequences. Its application to more than six million human
ESTs in Genbank of date 2005-04-06, counting more than
3.3 billion base pairs, takes less than four hours to find
all more than seven million clusters of multiple substrings
of at least 50 nucleotides in length, say, by using a
standard PC with 2 GB of RAM, 2.8 GHz processor
speed. The corresponding program ClustDB is able to
handle at least eight times more data than VMATCH, the
most memory efficient exact software known today. Our
program is freely available for academic use.
Contact: juergen.kleffe@charite.
1. Introduction
With a given sequence in mind, the series of BLAST
programs and their various improvements, SSAHA [1],
PatternHunter [2], BLAT [3], and PIERS [4] are
sufficient to search for matches in even large sets of
sequences. But many problems begin with first
identifying the sequences of interest. The algorithms used
in such cases can no longer afford to compare in turn each
candidate sequence with a large database. Using suffix
trees or suffix arrays, more efficient and exact methods of
simultaneous sequence comparison can quickly identify
perfectly matching pairs of substrings which are often
extended to sequence alignments with errors. But because
of its high memory consumption the suffix tree approach
is known to fail in large scale applications. The program
VMATCH [5] uses the most space efficient virtual suffix
tree model. Its recent version can handle about 250 MB
of sequence using 2 GB of memory. Other methods like
REPFIND [6] and MUMMER3 [7] all require more than
eight bytes of RAM for each base pair of the sequences
under investigation. But in practice we need efficient
algorithms that can handle more base pairs of sequence
than there are bytes of computer memory. For instance,
expressed sequence tags (ESTs) form substantial
proportions of the sequence data stored in public
databases. Currently the human EST division of
GenBank, alone, contains more than seven million ESTs
adding up to more than 4.2 GB of sequence. These data
exceed the amount of sequences stored in all human
genomic contigs although ESTs are fractions of mRNAs
which contribute less than two percent to the human
genome. The reason for this excess is that ESTs are
sampled under different conditions, from specific tissues,
developmental stages, or pathological states. They
provide important information about mutations, post
transcriptional modification and alternative splicing that
is often revealed by perfectly and nearly perfectly
matching substrings. ESTs originating from alternatively
spliced mRNAs share long common substrings and parts
of ESTs representing mRNA fragments of highly
expressed genes are observed in large copy numbers. The
scientific analysis of these data will only be successful if
we are able to efficiently compare very large sets of
highly redundant sequences.
This paper reports applications to the set of all
Genbank human ESTs of dates 2005-04-06 and 2003-02-
26, respectively and the Arabidopsis thaliana ESTs of
date 02.12.02.
2. Algorithm
2.1 Substring-clusters
Indexing of databases allows identifying in optimal
time all pairs of substrings matching over a given
minimal length. But the quadratic growth of the number
of such pairs with sequence length causes severe
computational problems. The concepts of maximal
matches, unique matches and super-maximal matches
were introduced to reduce output. We propose an
alternative linear space representation of matching
substrings, called substring-clusters and output a table
with three columns named cluster (c), sequence number
(s) and match position (p). Each cluster is formed by a
subset of triples (c, s, p) with the same value for c and
Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06)
0-7695-2641-1/06 $20.00 © 2006