Indexing Highly Repetitive Collections Gonzalo Navarro Dept. of Computer Science, University of Chile. gnavarro@dcc.uchile.cl Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar com- pressed indexes, and Lempel-Ziv compressed indexes. 1 Introduction After a years-long race to sequence the first human genome in the early 2000’s, sequencing has become a routine activity that costs a few thousand dollars 1 and large sequencing companies are producing thousands of genomes per day 2 . Maintaining databases of millions of genomes will be a real possibility very soon. With a storage requirement of about 715 MB per human genome (about 3 × 10 9 bases using 2 bits each), storing, say, one million genomes is perfectly realistic (around 700 TB). However, sequence analysis tools require indexed access to the data, where one can carry out pattern searches and mining. Such indexes require at the very least 50 bits per base (one pointer), raising the 700 TB to more than 16 PB (petabytes). Those sizes, especially if we require fast indexed access to the data, exceed today’s technological possibilities within reasonable cost. What makes this challenge affordable is that most of those genomes (if we assume they belong to the same species, say human) are very similar to each other — 99.99% similar according to typical figures (although there is some debate about the exact value). If we were able to index such a collection within space proportional to the number of differences between the genomes, and not to their total size, then a one-million genome collection could be indexed within 1.6 TB, which is perfectly feasible. Yet, we still do not know how to do this. Other scenarios where a set of very similar sequences is indexed and pattern searches are provided on them are software repositories (where versions form a tree or graph structure) and versioned document collections (such as Wikipedia or the Internet Wayback Machine, where versions have a linear structure). In this short survey we will cover the results achieved on the challenge of storing and indexing those “highly repetitive” sequence collections. We will focus on the following scenario, looking for a balance of generality and simplicity that leads to both useful and algorithmically interesting research: 1 The Guardian Jan 12, 2012, “Company announces low-cost DNA decoding machine”. 2 The New York Times Jan 12, 2011, “DNA sequencing caught in deluge of data”.