An investigation of software for minisatellite detection Themba Masombuka School of Computing University of South Africa South Africa, Pretoria, 0003 Email: masomkt@unisa.ac.za Corne de Ridder School of Computing University of South Africa South Africa, Pretoria, 0003 Email: driddc@unisa.ac.za Derrick Kourie Fastar / Espresso Research Group Department of Computer Science University of Pretoria, South Africa Email: dkourie@up.ac.za Abstract—A tandem repeat is a special kind of subsequence in a DNA sequence. It is characterised by an introductory sequence of nucleotides, called its motif, which is followed by several contiguous copies of the motif. These copies may either be exact or approximate. A minisatellite is a tandem repeat whose motif length is within a certain prespecified range. This paper investigates software that searches for minisatel- lites. Four prominent publicly available software packages have been run on sample data. They employ different algorithms for doing the search; the notion of approximate matching differs; and the way in which search parameters are set varies. The results are studied and compared. Several significant differences in output derived from the various packages are reported. I. I NTRODUCTION Repetitive DNA sub-sequences are of relevance in biology for various reasons. These reasons include gene variations as well as regulatory functions on gene expressions. Tandem repeats constitute two or more contiguous copies of a nu- cleotide sequence [1], the sequence being called the motif. Consider the sequence ACGTCG ACGTCG ACGTCG TT . In this case the motif ACGTCG of length 6 is repeated 3 times as ACGTCG ACGTCG ACGTCG TT . The motif CGTCGA is also repeated 3 times, allowing for one mutation, as in A CGTCGA CGTCGA CGTCGT T . As it can be seen, the last sub-string is not the exact copy of the motif. In the former case the string is referred to as a perfect tandem repeat (PTR); in the latter the string is referred to as an approximate tandem repeat (ATR). Thus in the case of an ATR, the so called motif or perfect tandem repeat element (PTRE) is followed by one or more approximate tandem repeat elements (ATREs). ATREs are not exact copies of the motif 1 . Biologists distinguish between three types of TRs namely microsatellites, minisatellites and satellites. These three TRs differ in terms of the length of their consensus motif. Mi- crosatellites have a motif length of (2 ≤|motif |≤ 5), minisatellites have a motif length of (5 < |motif |≤ 100) and the motif length of satellites is (|motif | > 100). However, 1 Approximate copies are generally ascribed to one of three errors: a mismatch, insertion or deletion. In the case of a mismatch, a nucleotide other than the expected one appears in a given position; in the case of an insertion, a nucleotide is unexpectedly inserted ahead of another one; and in the case of a deletion, a nucleotide is absent from the string in its expected position. Some authors use the term “indel” to refer to either an insertion or a deletion. there are some inconsistencies with regard to this classification based on motif size. Delgrange and Rivals [2], Benson [3] and De Ridder et al [4] agree on the above classification. Thurson and Field [5] classify microsatellites as having motif size less than seven—hence, in their scheme minisatellites have 7 ≤|motif |≤ 100. For the purpose of this paper, the size of minisatellite motif is regarded as 5 < |motif |≤ 100. The detection of minisatellites can contribute to the develop- ment of DNA fingerprinting which has been used in forensic medicine, paternity testing and population genetics [6]–[8]. Minisatellites have also been associated with certain human diseases such as epilepsy and diabetes [9], [10]. The present study represents a starting point in a larger project, where the intention is to improve on minisatellite detecting algorithms. It is clearly of importance to establish the strengths and weaknesses of existing packages in terms of usability, data generated, accuracy, etc. This paper reports our findings in regard to data generated. Further research is needed in order to investigate the accuracy of implications. Our usability findings will be reported elsewhere. Several software packages that detect minisatellites are investigated, namely: Mreps [11], Phobos [12], TRF [3] and ATRHunter [13]. All of these packages are freely available on the web. Apart from Phobos, they have all been reported in the literature. Although this is not an exhaustive set of available packages, it was nevertheless considered sufficiently representative for our purposes. Note that TRF appears to be the package of choice for benchmarking purposes [4], [11], [13]. The remainder of the paper is laid out as follows, in Section II we provide an overview of the algorithmic details of software searching for minisatellites. Section III reports on the detected TRs by the respective software packages. The data generated by the different software packages is compared and reported on. II. ALGORITHMS The algorithms behind the identification of minisatellite can be classified as library based [4], [14] and ab initio [14]. Library based techniques determine the repetitive sequence by comparing an input to a set of known repeats in a database. An example of a Library based algorithm is RepeatMasker [15].