© 2011 Nature America, Inc. All rights reserved.
NATURE GENETICS ADVANCE ONLINE PUBLICATION 1
ARTICLES
Genomes are shaped by the interaction of diverse processes and
evolutionary forces: recombination, gene conversion, mutation,
selection and demography, as well as recurrent cycles of poly-
ploidization and subsequent diploidization, along with hybridiza-
tion and the associated processes of admixture and introgression.
Disentangling the effects of these processes on sequence variation
is essential not only for understanding how genetic diversity is gen-
erated and maintained but also for tracking down allelic variants
responsible for phenotypic variation. A. thaliana and its close rela-
tives have been at the forefront of investigations of these processes
in plants
1,2
. For example, both the local and global population struc-
tures of A. thaliana, which reflect the species’ migration history
since the Ice Age as well as the surprisingly frequent outcrossing
events between the inbred strains, have been studied in consider-
able detail
3,4
. The first genome-wide haplotype map of a plant was
produced for this species
5
, and the information from this endeavor
has already been successfully used for genome-wide association
studies (GWAS)
6–9
. Despite the rapid progress in linking genotype
to phenotype, a major gap remains in the ability to identify alleles
that are directly responsible for variation in adaptive traits. As in
humans, the complete sequencing of genomes provides an essential
stepping stone toward this goal. Moreover, the recent completion
of a reference genome sequence for the species’ closest relative,
Arabidopsis lyrata, is informing the interpretation of polymorphism
patterns in A. thaliana
10
.
Exploratory efforts with a small number of strains suggested early
on that short-read sequencing is an efficient means of describing
whole-genome sequence variation in A. thaliana
11,12
, and on the basis
of early successes, a 1001 Genomes Project for the species has been
advocated
13
(see URLs for project website). Here we present results
from the first major phase of the 1001 Genomes Project, an analysis of
80 strains that were chosen to represent the genetic diversity present
in eight populations across the entire native range of the species. The
study design supports systematic investigation of the effects of geo-
graphy and demography on whole-genome sequence variation.
RESULTS
Sequencing of 80 A. thaliana accessions
The native range of A. thaliana is in Eurasia, spanning varied
climates and elevations, from the high mountains of Central Asia to the
European Atlantic Coast, and from North Africa to the Arctic Circle.
To enable the discovery of both global and local effects on sequence
diversity, we focused on six larger geographic regions: the Iberian
Peninsula with North Africa; Southern Italy; Eastern Europe; the
Caucasus; Southern Russia; and Central Asia. In addition, we sampled
two much smaller regions, Swabia, in the southwest of Germany,
and South Tyrol, in the north of Italy (Fig. 1). From each region, we
selected 7–14 naturally inbred strains, or accessions, that we had iden-
tified as genetically diverse on the basis of limited genome-wide geno-
typing (Fig. 1a and Supplementary Table 1). From a single individual
Whole-genome sequencing of multiple Arabidopsis
thaliana populations
Jun Cao
1,8
, Korbinian Schneeberger
1,2,8
, Stephan Ossowski
1,3,4,8
, Torsten Günther
5,8
, Sebastian Bender
1
,
Joffrey Fitz
1
, Daniel Koenig
1
, Christa Lanz
1
, Oliver Stegle
6
, Christoph Lippert
6
, Xi Wang
1
, Felix Ott
1
,
Jonas Müller
1
, Carlos Alonso-Blanco
7
, Karsten Borgwardt
6
, Karl J Schmid
5
& Detlef Weigel
1
The plant Arabidopsis thaliana occurs naturally in many different habitats throughout Eurasia. As a foundation for identifying
genetic variation contributing to adaptation to diverse environments, a 1001 Genomes Project to sequence geographically diverse
A. thaliana strains has been initiated. Here we present the first phase of this project, based on population-scale sequencing
of 80 strains drawn from eight regions throughout the species’ native range. We describe the majority of common small-scale
polymorphisms as well as many larger insertions and deletions in the A. thaliana pan-genome, their effects on gene function,
and the patterns of local and global linkage among these variants. The action of processes other than spontaneous mutation is
identified by comparing the spectrum of mutations that have accumulated since A. thaliana diverged from its closest relative
10 million years ago with the spectrum observed in the laboratory. Recent species-wide selective sweeps are rare, and potentially
deleterious mutations are more common in marginal populations.
1
Max Planck Institute for Developmental Biology, Tübingen, Germany.
2
Max Planck Institute of Plant Breeding Research, Cologne, Germany.
3
Center for Genomic
Regulation, Barcelona, Spain.
4
Universitat Pompeu Fabra, Barcelona, Spain.
5
Institute of Plant Breeding, Seed Science and Population Genetics, University of
Hohenheim, Stuttgart, Germany.
6
Machine Learning and Computational Biology Research Group, Max Planck Institute for Intelligent Systems and Max Planck
Institute for Developmental Biology, Tübingen, Germany.
7
Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas, Madrid, Spain.
8
These authors contributed equally to this work. Correspondence should be addressed to D.W. (weigel@weigelworld.org).
Received 8 March; accepted 26 July; published online 28 August 2011; doi:10.1038/ng.911