BIOINFORMATICS HITSEQ PAPER Vol. 26 no. 10 2010, pages 1291–1298 doi:10.1093/bioinformatics/btq153 Sequence analysis Advance Access publication April 8, 2010 Structural variation analysis with strobe reads Anna Ritz 1, , Ali Bashir 2, and Benjamin J. Raphael 1,3, 1 Department of Computer Science, Brown University, Providence, RI 02912, 2 Pacific Biosciences, 1505 Adams Drive, Menlo Park, CA 94025 and 3 Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA Associate Editor: Alex Bateman ABSTRACT Motivation: Structural variation including deletions, duplications and rearrangements of DNA sequence are an important contributor to genome variation in many organisms. In human, many structural variants are found in complex and highly repetitive regions of the genome making their identification difficult. A new sequencing technology called strobe sequencing generates strobe reads containing multiple subreads from a single contiguous fragment of DNA. Strobe reads thus generalize the concept of paired reads, or mate pairs, that have been routinely used for structural variant detection. Strobe sequencing holds promise for unraveling complex variants that have been difficult to characterize with current sequencing technologies. Results: We introduce an algorithm for identification of structural variants using strobe sequencing data. We consider strobe reads from a test genome that have multiple possible alignments to a reference genome due to sequencing errors and/or repetitive sequences in the reference. We formulate the combinatorial optimization problem of finding the minimum number of structural variants in the test genome that are consistent with these alignments. We solve this problem using an integer linear program. Using simulated strobe sequencing data, we show that our algorithm has better sensitivity and specificity than paired read approaches for structural variation identification. Contact: braphael@brown.edu Received on March 4, 2010; revised on April 2, 2010; accepted on April 5, 2010 1 INTRODUCTION Identifying the DNA sequence differences that distinguish individuals is a major challenge in genetics. Recent whole- genome sequencing and microarray measurements have shown that copy number variants (insertions, duplications and deletions) and balanced rearrangements, such as inversions and translocations, are common in most organisms including human (Sharp et al., 2006), mouse (Egan et al., 2007), fly (Dopman and Hartl, 2007) and yeast (Faddah et al., 2009). These larger differences in DNA sequences are commonly referred to as structural variants. The Database of Genomic Variants (Iafrate et al., 2004) currently (winter 2010) lists nearly 30 000 copy number variants and nearly 900 inversion variants in the human genome. Although some of these variants are To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. redundant and/or erroneous, it is clear that structural variation is an important component of human genome variation. In fact, there are more total base pairs in human genome affected by structural variants than single nucleotide polymorphisms (SNP; Redon et al., 2006). Both common and inherited structural variants and de novo structural variants have recently been linked to a number of human diseases (Girirajan et al., 2010; Greenway et al., 2009; Marshall et al., 2008). Moreover, somatic structural variants are common in cancer genomes and lead to altered regulation of oncogenes and tumor suppressor genes (Albertson et al., 2003) and the creation of novel fusion genes (Mitelman et al., 2004). Much of the recent excitement surrounding structural variation stems from better measurement technologies. In particular, End Sequence profiling (Raphael et al., 2003; Volik et al., 2003), also known as paired read mapping (Korbel et al., 2007; Tuzun et al., 2005), has been used to identify structural variants in both normal and cancer genomes. In paired read mapping, DNA fragments from a test genome are sequenced from both ends, and these sequences (reads) are mapped to a reference genome. Paired reads, or mate pairs, with discordant alignments identify inversions, translocations, transpositions, insertions, deletions and other rearrangements that distinguish the test genome from the reference genome. A number of methods have been introduced to identify structural variants from paired read sequencing data (Bashir et al., 2008; Chen et al., 2009; Hormozdiari et al., 2009; Korbel et al., 2009; Lee et al., 2008; Quinlan et al., 2010; Raphael et al., 2003). Structural variants vary widely in size and complexity, and are more difficult to characterize than SNPs. Many are associated with repeated sequences in the genome (Korbel et al., 2007), complicating their detection and characterization. In extreme cases, the variants themselves have highly repetitive or complex organization relative to the reference genome. For example, different lists of variants have been identified in the same individual using older clone-based sequencing (Kidd et al., 2008) and various next- generation sequencing platforms (Bentley et al., 2008; Hormozdiari et al., 2009; Korbel et al., 2007). Characterizing these complex variants requires longer reads, longer fragments, or both. Pacific Biosciences recently demonstrated strobe sequencing technology (Turner, 2009). A strobe read, or strobe, consists of multiple subreads from a single contiguous fragment of DNA. These subreads are separated by a number of ‘dark’ nucleotides (called advances), whose identity is unknown (Fig. 1). A strobe with two subreads is analogous to a paired read, while strobes with more than two subreads provide additional information for structural variant detection. Thus far, Pacific Biosciences has demonstrated strobe reads with lengths up to 10 kb with 2–4 subreads each of 50–200 bp. Additional improvements are expected as technology matures. © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1291 Downloaded from https://academic.oup.com/bioinformatics/article/26/10/1291/194131 by guest on 30 November 2021