ORIGINAL ARTICLE Quality control metrics improve repeatability and reproducibility of single-nucleotide variants derived from whole-genome sequencing W Zhang 1 , V Soika 2 , J Meehan 1 , Z Su 1 , W Ge 1 , HW Ng 1 , R Perkins 1 , V Simonyan 2 , W Tong 1 and H Hong 1 Although many quality control (QC) methods have been developed to improve the quality of single-nucleotide variants (SNVs) in SNV-calling, QC methods for use subsequent to single-nucleotide polymorphism-calling have not been reported. We developed ﬁve QC metrics to improve the quality of SNVs using the whole-genome-sequencing data of a monozygotic twin pair from the Korean Personal Genome Project. The QC metrics improved both repeatability between the monozygotic twin pair and reproducibility between SNV-calling pipelines. We demonstrated the QC metrics improve reproducibility of SNVs derived from not only whole-genome-sequencing data but also whole-exome-sequencing data. The QC metrics are calculated based on the reference genome used in the alignment without accessing the raw and intermediate data or knowing the SNV-calling details. Therefore, the QC metrics can be easily adopted in downstream association analysis. The Pharmacogenomics Journal advance online publication, 11 November 2014; doi:10.1038/tpj.2014.70 INTRODUCTION Associating genetic variations and changes in populations with phenotypic traits is the essential goal in genetic studies. Genetic variants in candidate genes or in a whole genome are explored to uncover the genetic variants associated with the phenotypes in study. 1–3 To conduct a genetic study, especially a genome-wide association study (GWAS), a set of all the possible genetic variants in the genome needs to be determined for all the subjects used in the study. Massively parallel measurements are carried out for each subject (we hereafter refer to this part as the upstream analysis) before identiﬁcation of which genetic variants that are associated with the phenotypic traits in the study (we hereafter refer to this part as the downstream analysis). The ﬁrst GWAS was published in 2005, wherein a functional single-nucleotide polymorphism (SNP) in the complement factor H was inferred to be associated with age-related macular degene- ration. 4 As then, GWAS has been widely applied to identify genetic variants associated with the risk of 4200 diseases and human phenotypic traits 5–12 (http://www.genome.gov/gwastudies/). How- ever, replication studies demonstrated that only a small portion of the associated loci in the initial GWAS could be replicated, even within the same populations. 13,14 Not surprisingly, concerns arose regarding reliability and usability of GWAS ﬁndings based on SNP array genotyping technologies. 15–17 In addition to other factors such as case–control misclassi- ﬁcation 18 and non-genetic covariates, 19 inaccurate genotyping data was also found to contribute to the false associations. 20–23 Simulations revealed that a very small discordance in genotypes could markedly change odds ratios of genetic markers, especially for genetic variants with low frequency, therefore deleteriously affecting the ﬁnal conclusions of a GWAS. 24 Technologies and analysis methods that together generate accurate genotypes are vital for improving genetic study effectiveness and reliability. Beyond microarray technology that has long been the mainstay, 25 next-generation sequencing (NGS) 26,27 has emerged as the preferred high-throughput genotyping technology for genetic study. 28–32 Upstream analysis using NGS is a complicated process comprising many steps, 33 including DNA fragmentation, sequencing, base-calling to generate the sequences of the DNA fragments (raw reads), mapping raw reads to a reference genome and determining single-nucleotide variants (SNVs). Each step can introduce errors that may affect the quality of SNVs passed to downstream analysis. Therefore, quality controls (QCs) for each step in upstream analysis have been developed to detect, prevent, and generally mitigate errors and analyses biases with the cumulative end of improving the quality of SNVs derived from NGS data. Notable examples of QC interventions include library preparation, 34,35 base-calling, 36 raw reads, 37 mapping 38 and SNV- calling. 39,40 Despite the multiple QC strategies that have been applied in upstream analysis, the quality in terms of accuracy, repeatability and reproducibility of SNVs that would be passed to downstream analysis has persisting deﬁciencies. SNVs quality deﬁcits were evidenced by recent ﬁndings that the three popular sequencing platforms (Roche/454, Illumina/HiSeq and Life Technologies/ SOLiD) had SNV detection biases. 41 Low concordances were observed between the SNVs called using ﬁve popular SNV-calling pipelines on the same exome-sequencing data. 42 The need for additional improvement in SNV’s quality for genetic studies suggests an opportunity for further QC interventions to be carried out subsequent to upstream, but before or in concert with downstream analysis. To our knowledge, no such post-upstream 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA and 2 Ofﬁce of The Center Director, Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, MD, USA. Correspondence: Dr H Hong, Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA. E-mail: huixiao.hong@fda.hhs.gov Received 31 March 2014; revised 16 July 2014; accepted 19 September 2014 The Pharmacogenomics Journal (2014), 1 – 12 © 2014 Macmillan Publishers Limited All rights reserved 1470-269X/14 www.nature.com/tpj