ORIGINAL ARTICLE
Quality control metrics improve repeatability and
reproducibility of single-nucleotide variants derived
from whole-genome sequencing
W Zhang
1
, V Soika
2
, J Meehan
1
, Z Su
1
, W Ge
1
, HW Ng
1
, R Perkins
1
, V Simonyan
2
, W Tong
1
and H Hong
1
Although many quality control (QC) methods have been developed to improve the quality of single-nucleotide variants (SNVs) in
SNV-calling, QC methods for use subsequent to single-nucleotide polymorphism-calling have not been reported. We developed five
QC metrics to improve the quality of SNVs using the whole-genome-sequencing data of a monozygotic twin pair from the Korean
Personal Genome Project. The QC metrics improved both repeatability between the monozygotic twin pair and reproducibility
between SNV-calling pipelines. We demonstrated the QC metrics improve reproducibility of SNVs derived from not only
whole-genome-sequencing data but also whole-exome-sequencing data. The QC metrics are calculated based on the reference
genome used in the alignment without accessing the raw and intermediate data or knowing the SNV-calling details. Therefore,
the QC metrics can be easily adopted in downstream association analysis.
The Pharmacogenomics Journal advance online publication, 11 November 2014; doi:10.1038/tpj.2014.70
INTRODUCTION
Associating genetic variations and changes in populations with
phenotypic traits is the essential goal in genetic studies. Genetic
variants in candidate genes or in a whole genome are explored to
uncover the genetic variants associated with the phenotypes in
study.
1–3
To conduct a genetic study, especially a genome-wide
association study (GWAS), a set of all the possible genetic variants
in the genome needs to be determined for all the subjects used
in the study. Massively parallel measurements are carried out
for each subject (we hereafter refer to this part as the upstream
analysis) before identification of which genetic variants that are
associated with the phenotypic traits in the study (we hereafter
refer to this part as the downstream analysis).
The first GWAS was published in 2005, wherein a functional
single-nucleotide polymorphism (SNP) in the complement factor H
was inferred to be associated with age-related macular degene-
ration.
4
As then, GWAS has been widely applied to identify genetic
variants associated with the risk of 4200 diseases and human
phenotypic traits
5–12
(http://www.genome.gov/gwastudies/). How-
ever, replication studies demonstrated that only a small portion
of the associated loci in the initial GWAS could be replicated,
even within the same populations.
13,14
Not surprisingly, concerns
arose regarding reliability and usability of GWAS findings based on
SNP array genotyping technologies.
15–17
In addition to other factors such as case–control misclassi-
fication
18
and non-genetic covariates,
19
inaccurate genotyping
data was also found to contribute to the false associations.
20–23
Simulations revealed that a very small discordance in genotypes
could markedly change odds ratios of genetic markers, especially
for genetic variants with low frequency, therefore deleteriously
affecting the final conclusions of a GWAS.
24
Technologies and
analysis methods that together generate accurate genotypes are
vital for improving genetic study effectiveness and reliability.
Beyond microarray technology that has long been the
mainstay,
25
next-generation sequencing (NGS)
26,27
has emerged
as the preferred high-throughput genotyping technology for
genetic study.
28–32
Upstream analysis using NGS is a complicated
process comprising many steps,
33
including DNA fragmentation,
sequencing, base-calling to generate the sequences of the DNA
fragments (raw reads), mapping raw reads to a reference genome
and determining single-nucleotide variants (SNVs). Each step
can introduce errors that may affect the quality of SNVs passed
to downstream analysis. Therefore, quality controls (QCs) for each
step in upstream analysis have been developed to detect, prevent,
and generally mitigate errors and analyses biases with the
cumulative end of improving the quality of SNVs derived from
NGS data. Notable examples of QC interventions include library
preparation,
34,35
base-calling,
36
raw reads,
37
mapping
38
and SNV-
calling.
39,40
Despite the multiple QC strategies that have been applied in
upstream analysis, the quality in terms of accuracy, repeatability
and reproducibility of SNVs that would be passed to downstream
analysis has persisting deficiencies. SNVs quality deficits were
evidenced by recent findings that the three popular sequencing
platforms (Roche/454, Illumina/HiSeq and Life Technologies/
SOLiD) had SNV detection biases.
41
Low concordances were
observed between the SNVs called using five popular SNV-calling
pipelines on the same exome-sequencing data.
42
The need for
additional improvement in SNV’s quality for genetic studies
suggests an opportunity for further QC interventions to be carried
out subsequent to upstream, but before or in concert with
downstream analysis. To our knowledge, no such post-upstream
1
Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA and
2
Office of The Center Director,
Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, MD, USA. Correspondence: Dr H Hong, Division of Bioinformatics and Biostatistics,
National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA.
E-mail: huixiao.hong@fda.hhs.gov
Received 31 March 2014; revised 16 July 2014; accepted 19 September 2014
The Pharmacogenomics Journal (2014), 1 – 12
© 2014 Macmillan Publishers Limited All rights reserved 1470-269X/14
www.nature.com/tpj