INFORMATICS OFFICIAL JOURNAL www.hgvs.org AmpliVar: Mutation Detection in High-Throughput Sequence from Amplicon-Based Libraries Arthur L. Hsu, 1 ∗ ‡ Olga Kondrashova, 1 ‡ Sebastian Lunke, 1 Clare J. Love, 1 Cliff Meldrum, 2 Renate Marquis-Nicholson, 1 Greg Corboy, 1 Kym Pham, 1 Matthew Wakeﬁeld, 3 Paul M. Waring, 1 and Graham R. Taylor 1 † 1 Department of Pathology, The University of Melbourne, Parkville, Victoria, Australia; 2 Molecular Genetics Laboratory, Pathology North, Newcastle, New South Wales, Australia; 3 Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia Communicated by Mark H. Paalman Received 7 August 2014; accepted revised manuscript 21 January 2015. Published online 9 February 2015 in Wiley Online Library (www.wiley.com/humanmutation). DOI: 10.1002/humu.22763 ABSTRACT: Conventional means of identifying variants in high-throughput sequencing align each read against a reference sequence, and then call variants at each position. Here, we demonstrate an orthogonal means of identifying sequence variation by grouping the reads as amplicons prior to any alignment. We used AmpliVar to make key-value hashes of sequence reads and group reads as individual amplicons using a table of ﬂanking se- quences. Low-abundance reads were removed according to a selectable threshold, and reads above this threshold were aligned as groups, rather than as individual reads, permit- ting the use of sensitive alignment tools. We show that this approach is more sensitive, more speciﬁc, and more computationally efﬁcient than comparable methods for the analysis of amplicon-based high-throughput sequencing data. The method can be extended to enable alignment- free conﬁrmation of variants seen in hybridization capture target-enrichment data. Hum Mutat 36:411–418, 2015. C ⃝ 2015 Wiley Periodicals, Inc. KEY WORDS: next generation sequencing; amplicon se- quencing; mutation detection; grouped reads Introduction Alignment of reads to a reference genome is an early step in a typical data analysis pathway for clonal sequencing, followed by variant calling against the reference sequence and ﬁnally annotation of the variants. This approach requires the alignment of millions of individual reads per sample, and places a high computational Additional Supporting Information may be found in the online version of this article. † Correspondence to: Graham R Taylor. Email: graham.taylor@unimelb.edu.au ∗ Correspondence may also be addressed to: Arthur L. Hsu. Email: alhsu@ unimelb.edu.au ‡ These authors contributed equally to this work. Contract Grant Sponsors: the National Health and Medical Research Council (NHRMC, APP1025879 and APP1029974); the Victorian Cancer Agency; the Victorian Comprehensive Cancer Centre and the Education Investment Fund of the Australian Government’s Super Science Initiative; the Herman Trust. Additional Supporting Information may be found in the online version of this article. burden on alignment software. As such, short read alignment algorithms such as BWA [Li and Durbin, 2010] and Bowtie [Langmead et al., 2009] place higher emphasis on speed, than accuracy, when compared with conventional alignment methods such as BLAST [Altschul et al., 1990], BLAT [Kent, 2002], and Smith–Waterman [Smith and Waterman, 1981]. An alternative approach would be to reduce the number of the input reads, so that fewer reads need to be aligned. If the reads are derived from amplicons, then multiple identical reads are to be expected and this can be exploited at an early stage of the analysis. In applications that require highly speciﬁc selection of target re- gions, or high-sequencing depth, amplicon-based methods may be favored for library construction over whole genome or in-solution hybridization capture target-enrichment methods. Sequencing amplicon products is a special case, because each amplicon can be treated as a set of reads as deﬁned by their PCR primers. Further- more, the reads are already effectively mapped, since the loci of the primers are known by design. Read groups within each amplicon correspond to a mixture of wild-type sequence, true variants, and experimental artifacts. Grouping reads in each amplicon and sorting by abundance can remove unique or rare sequences that most likely correspond to artifacts. This should leave only a few abundant read types per amplicon, corresponding to wild-type alleles and variants, greatly simplifying the analysis of those variants. While amplicon methods are a popular way to prepare targeted libraries, there are currently only a few analysis methods that are speciﬁcally designed for amplicon sequencing analysis, for example, TSSV [Anvar et al., 2014], Mutascope [Yost et al., 2013], and MiSeq Reporter Custom Amplicon Workﬂow (Illumina, Inc., San Diego, CA). We used the primer sequences to identify and group the target reads as key-value pairs, where the key is the sequence and the value is the read count. Grouped read testing in this way enables the read content of each amplicon to be interrogated individually, without prior alignment. Each sequence variant in a particular amplicon must correspond to a genuine variant, ampliﬁcation error, sequencing error, or error due to ex vivo template modiﬁcation. For higher multiplex amplicon sets, or to identify variants not deﬁned in the genotype tables, the sequence variants are passed to BLAT as read groups for alignment, followed by the generation of BAM ﬁles, and then variant calling and annotation. Here, we demonstrate the performance of our method for evaluating quality of the amplicon data, as well as variant detection, by comparing it with other available NGS (next-generation sequencing) data analysis methods. C ⃝ 2015 WILEY PERIODICALS, INC.