BIOINFORMATICS HITSEQ PAPER
Vol. 26 no. 10 2010, pages 1291–1298
doi:10.1093/bioinformatics/btq153
Sequence analysis Advance Access publication April 8, 2010
Structural variation analysis with strobe reads
Anna Ritz
1, †
, Ali Bashir
2, †
and Benjamin J. Raphael
1,3, ∗
1
Department of Computer Science, Brown University, Providence, RI 02912,
2
Pacific Biosciences, 1505 Adams Drive,
Menlo Park, CA 94025 and
3
Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA
Associate Editor: Alex Bateman
ABSTRACT
Motivation: Structural variation including deletions, duplications and
rearrangements of DNA sequence are an important contributor to
genome variation in many organisms. In human, many structural
variants are found in complex and highly repetitive regions of
the genome making their identification difficult. A new sequencing
technology called strobe sequencing generates strobe reads
containing multiple subreads from a single contiguous fragment
of DNA. Strobe reads thus generalize the concept of paired
reads, or mate pairs, that have been routinely used for structural
variant detection. Strobe sequencing holds promise for unraveling
complex variants that have been difficult to characterize with current
sequencing technologies.
Results: We introduce an algorithm for identification of structural
variants using strobe sequencing data. We consider strobe reads
from a test genome that have multiple possible alignments to
a reference genome due to sequencing errors and/or repetitive
sequences in the reference. We formulate the combinatorial
optimization problem of finding the minimum number of structural
variants in the test genome that are consistent with these alignments.
We solve this problem using an integer linear program. Using
simulated strobe sequencing data, we show that our algorithm has
better sensitivity and specificity than paired read approaches for
structural variation identification.
Contact: braphael@brown.edu
Received on March 4, 2010; revised on April 2, 2010; accepted on
April 5, 2010
1 INTRODUCTION
Identifying the DNA sequence differences that distinguish
individuals is a major challenge in genetics. Recent whole-
genome sequencing and microarray measurements have shown that
copy number variants (insertions, duplications and deletions) and
balanced rearrangements, such as inversions and translocations, are
common in most organisms including human (Sharp et al., 2006),
mouse (Egan et al., 2007), fly (Dopman and Hartl, 2007) and yeast
(Faddah et al., 2009). These larger differences in DNA sequences
are commonly referred to as structural variants. The Database of
Genomic Variants (Iafrate et al., 2004) currently (winter 2010)
lists nearly 30 000 copy number variants and nearly 900 inversion
variants in the human genome. Although some of these variants are
∗
To whom correspondence should be addressed.
†
The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
redundant and/or erroneous, it is clear that structural variation is
an important component of human genome variation. In fact, there
are more total base pairs in human genome affected by structural
variants than single nucleotide polymorphisms (SNP; Redon et al.,
2006). Both common and inherited structural variants and de novo
structural variants have recently been linked to a number of human
diseases (Girirajan et al., 2010; Greenway et al., 2009; Marshall
et al., 2008). Moreover, somatic structural variants are common in
cancer genomes and lead to altered regulation of oncogenes and
tumor suppressor genes (Albertson et al., 2003) and the creation of
novel fusion genes (Mitelman et al., 2004).
Much of the recent excitement surrounding structural variation
stems from better measurement technologies. In particular, End
Sequence profiling (Raphael et al., 2003; Volik et al., 2003), also
known as paired read mapping (Korbel et al., 2007; Tuzun et al.,
2005), has been used to identify structural variants in both normal
and cancer genomes. In paired read mapping, DNA fragments from
a test genome are sequenced from both ends, and these sequences
(reads) are mapped to a reference genome. Paired reads, or mate
pairs, with discordant alignments identify inversions, translocations,
transpositions, insertions, deletions and other rearrangements that
distinguish the test genome from the reference genome. A number
of methods have been introduced to identify structural variants from
paired read sequencing data (Bashir et al., 2008; Chen et al., 2009;
Hormozdiari et al., 2009; Korbel et al., 2009; Lee et al., 2008;
Quinlan et al., 2010; Raphael et al., 2003).
Structural variants vary widely in size and complexity, and are
more difficult to characterize than SNPs. Many are associated
with repeated sequences in the genome (Korbel et al., 2007),
complicating their detection and characterization. In extreme
cases, the variants themselves have highly repetitive or complex
organization relative to the reference genome. For example, different
lists of variants have been identified in the same individual using
older clone-based sequencing (Kidd et al., 2008) and various next-
generation sequencing platforms (Bentley et al., 2008; Hormozdiari
et al., 2009; Korbel et al., 2007). Characterizing these complex
variants requires longer reads, longer fragments, or both.
Pacific Biosciences recently demonstrated strobe sequencing
technology (Turner, 2009). A strobe read, or strobe, consists of
multiple subreads from a single contiguous fragment of DNA. These
subreads are separated by a number of ‘dark’ nucleotides (called
advances), whose identity is unknown (Fig. 1). A strobe with two
subreads is analogous to a paired read, while strobes with more than
two subreads provide additional information for structural variant
detection. Thus far, Pacific Biosciences has demonstrated strobe
reads with lengths up to 10 kb with 2–4 subreads each of 50–200 bp.
Additional improvements are expected as technology matures.
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1291
Downloaded from https://academic.oup.com/bioinformatics/article/26/10/1291/194131 by guest on 30 November 2021