A collaboration between l'INRA and l'Institut de
l'abeille (ITSAP) is currently in progress, the
SeqApiPop project, in which 1000 drones
representing 1000 colonies will be paired-end
sequenced at 6X coverage on the Illumina HiSeq
platform. The aim of which is to characterise the
genetic diversity of the domestic black bee, Apis
mellifera mellifera, in France.
Heterozygosity is a major challenge to high-quality
variant calling in next-generation sequence data,
typically requiring high-depths of coverage to
circumvent. This places considerable economic
constraints on population studies of humans and
livestock species, which are diploid in nature and
typically have large genome sizes (> 2 Gb). The
haplodiploid nature of honeybees, and their small
genome size (< 250 Mb), permits large population
studies to be conducted economically by whole-
genome sequencing of haploid drones, although to
date no research group to our knowledge has
capitalised on this opportunity. Here we present an
overview of the bioinformatics framework and
preliminary analyses following its application to 60
drones.
METHODS
One drone was sequenced per colony from a total of
60 colonies representing two distinct populations; one
used in the production of honey (JFM), and one for
the production of royal jelly (OTH). Reads were
mapped to the reference (Amel4.5) using BWA-
MEM
1
, variants called (Fig 1A), and missing
genotypes imputed within-population using BEAGLE
2
(Fig 1B). Population data were then merged ( Fig 1C)
and converted from VCF to Plink
3
format using Plink
v1.9
4
. Subsequent analysis was conducted in R
5
to
detect signatures of selection, and ADMIXTURE
6
to
estimate ancestry.
RESULTS
Alignment metrics indicate a mean depth of coverage
across all data of 7X. On average, reads mapped to
92% of the genome, of which 69% was considered
callable (depth ≥ 3X). The average number of variants
identified per caller, and the percentage retained after
combining with BAYSIC
7
were: GATK
8
952K (87%),
Pileup
9
846K (94%), Platypus
10
492K (98%); resulting
in a final set of 830K SNPs per individual . The
combined data resulted in 4.27M SNPs overall, of
which an average of 7% were imputed per sample.
4M SNPs mapped to LG1-16 and MT , of which 34K
were unique to JFM and 19K unique to OTH.
Prior to running ADMIXTURE the data was pruned to
remove SNPs in high linkage disequilibrium (r
2
> 0.3)
in 50-SNP windows (10-SNP step size) using Plink
v1.9. Resulting in 665K SNPs for ADMIXTURE
analysis, the results of which are presented in Fig 2.
The optimal K, inferred from the cross-validation error,
was observed to be 2 and this correctly partitions the
two populations. However it is interesting to note with
increasing K values the difference in admixture levels
between the two populations. An identity-by-state
(IBS) matrix calculated in Plink using the same
dataset reported a mean IBS of 0.79 in JFM and
0.84 in OTH.
Signatures of selection were detected by estimating
FST using a sliding window approach ( Fig 3).
Windows overlapped by 75%, and significance
thresholds were marked at the 99
th
and 99.9
th
percentiles. A number of regions exceed the
significance thresholds, the most notable of which is
an interval on LG1 ( Fig 3B)which corresponds with a
previously published quantitative trait loci (QTL;
pln2)
11
. The peak of this interval (LG1:18.51-18.67
Mb) contains an odorant receptor gene, suggesting it
to be a strong candidate for honey production. Two
less substantial intervals were detected on LG11 ( Fig
3C) in proximity to the major royal jelly protein gene
complex.
CONCLUSIONS
Sequencing of haploid drones circumvents issues of
heterozygosity, enabling a lower depth in variant
calling. An assessment of this strategy in 60 bees,
following basic tests to measure LD, admixture, and
to detect signatures of selection indicates that it will
be effective for larger studies.
Characterisation of French honey bee populations by
whole-genome sequencing
“sequencing drones circumvents issues of heterozygosity”
Alain Vignal
1
, David Wragg
1
, Benjamin Basso
2
, Jean-Pierre Bidanel
3
, Yves Le Conte
4
1
INRA, GenPhySE, Toulouse;
2
ITSAP, PrADE, Avignon;
3
INRA, GABI, Jouy-en-Josas;
4
INRA, PrADE, Avignon
Correspondance to david.wragg@toulouse.inra.fr or alain.vignal@toulouse.inra.fr
UMR1388 GenPhySE
31326 Castanet Tolosan, France
https://www-lgc.toulouse.inra.fr/intranet
REFERENCES
1. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio (2013). at <http://arxiv.org/abs/1303.3997>
2. Browning, B. L. & Browning, S. R. A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
3. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
4 Purcell, S & Chang, C. PLINK v1.90ble. at <https://www.cog-genomics.org/plink2>
5. R Development Core Team. R: A Language and Environment for Statistical Computing. 1, 409 (2011).
6. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. (2009). doi:10.1101/gr.094052.109
7. Cantarel, B. L. et al. BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity. BMC Bioinformatics 15, 104 (2014).
8. McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
9. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl. 25, 2078–2079 (2009).
10. Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. advance online publication, (2014).
11. Page, R. E., Fondrk, M. K. & Rueppell, O. Complex pleiotropy characterizes the pollen hoarding syndrome in honey bees (Apis mellifera L.). Behav. Ecol. Sociobiol. 66, 1459–1466 (2012).
Fig 2 | ADMIXTURE plot for K 2 to 5
JFM OTH
K = 2
K = 3
K = 4
[ B ] [ A ] [ C ]
Fig 1 | SeqApiPop framework
Reads are mapped and processed according to best practices ( A). Variants are called using GATK, Platypus and Pileup, and then
combined using BAYSIC which estimates their likelihood of being genuine. SNP sites are identified across all samples. Within-
population, samples are re-genotyped at all sites and missing values imputed ( B). Multiple populations are merged into a single
dataset for downstream analysis ( C).
0.0
0.1
0.2
0.3
0.4
0.5
Chromosome
Fst
1
0.0
0.1
0.2
0.3
0.4
Chromosome
Fst
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Chromosome
Fst
11
A
B
C
Fig 3 | Signatures of selection
Genome-wide FST based analyses using sliding windows with
75% overlap. Significance thresholds indicate 99
th
and 99.9
th
percentiles. Chromosomes 1-16 illustrated following analysis
using 40 Kb windows (A), whilst individual plots for
chromosome 1 (B) and 11 (C) were generated using 10 Kb
windows. Red vertical lines indicate quantitative trait loci
linked to pollen hoarding
11
, whilst blue vertical lines indicate
the boundaries of the major royal jelly protein gene complex.