ClustDB: A High-Performance Tool for Large Scale Sequence Matching Jürgen Kleffe, Friedrich Möller and Burghardt Wittig Charité Berlin, Institut für Molekularbiologie und Bioinformatik, Arnimallee 22, 14195 Berlin E-Mail: juergen.kleffe@charite.de, friedrich.moeller@charite.de,burghardt.wittig@charite.de Abstract High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use. Contact: juergen.kleffe@charite. 1. Introduction With a given sequence in mind, the series of BLAST programs and their various improvements, SSAHA [1], PatternHunter [2], BLAT [3], and PIERS [4] are sufficient to search for matches in even large sets of sequences. But many problems begin with first identifying the sequences of interest. The algorithms used in such cases can no longer afford to compare in turn each candidate sequence with a large database. Using suffix trees or suffix arrays, more efficient and exact methods of simultaneous sequence comparison can quickly identify perfectly matching pairs of substrings which are often extended to sequence alignments with errors. But because of its high memory consumption the suffix tree approach is known to fail in large scale applications. The program VMATCH [5] uses the most space efficient virtual suffix tree model. Its recent version can handle about 250 MB of sequence using 2 GB of memory. Other methods like REPFIND [6] and MUMMER3 [7] all require more than eight bytes of RAM for each base pair of the sequences under investigation. But in practice we need efficient algorithms that can handle more base pairs of sequence than there are bytes of computer memory. For instance, expressed sequence tags (ESTs) form substantial proportions of the sequence data stored in public databases. Currently the human EST division of GenBank, alone, contains more than seven million ESTs adding up to more than 4.2 GB of sequence. These data exceed the amount of sequences stored in all human genomic contigs although ESTs are fractions of mRNAs which contribute less than two percent to the human genome. The reason for this excess is that ESTs are sampled under different conditions, from specific tissues, developmental stages, or pathological states. They provide important information about mutations, post transcriptional modification and alternative splicing that is often revealed by perfectly and nearly perfectly matching substrings. ESTs originating from alternatively spliced mRNAs share long common substrings and parts of ESTs representing mRNA fragments of highly expressed genes are observed in large copy numbers. The scientific analysis of these data will only be successful if we are able to efficiently compare very large sets of highly redundant sequences. This paper reports applications to the set of all Genbank human ESTs of dates 2005-04-06 and 2003-02- 26, respectively and the Arabidopsis thaliana ESTs of date 02.12.02. 2. Algorithm 2.1 Substring-clusters Indexing of databases allows identifying in optimal time all pairs of substrings matching over a given minimal length. But the quadratic growth of the number of such pairs with sequence length causes severe computational problems. The concepts of maximal matches, unique matches and super-maximal matches were introduced to reduce output. We propose an alternative linear space representation of matching substrings, called substring-clusters and output a table with three columns named cluster (c), sequence number (s) and match position (p). Each cluster is formed by a subset of triples (c, s, p) with the same value for c and Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06) 0-7695-2641-1/06 $20.00 © 2006