Feature Review
Tools and Strategies for Long-Read
Sequencing and De Novo Assembly of
Plant Genomes
Hyungtaek Jung,
1,
* Christopher Winefield,
2
Aureliano Bombarely,
3,4
Peter Prentis,
5
and Peter Waterhouse
1,6,
*
The commercial release of third-generation sequencing technologies (TGSTs),
giving long and ultra-long sequencing reads, has stimulated the development
of new tools for assembling highly contiguous genome sequences with unprec-
edented accuracy across complex repeat regions. We survey here a wide range
of emerging sequencing platforms and analytical tools for de novo assembly,
provide background information for each of their steps, and discuss the spec-
trum of available options. Our decision tree recommends workflows for the
generation of a high-quality genome assembly when used in combination with
the specific needs and resources of a project.
Challenges and Progress with Plant Genomics
A genome assembly is simply the sequence produced after all of the chromosomes of a target
species have been fragmented (a large number of short/long DNA sequences), sequenced,
and computationally put back together again to create a representation of the original intact
chromosome sequences. De novo genome assembly assumes no prior knowledge of the
source DNA sequence length, layout, or composition. The usual aim of a genome assembly is
to build a highly accurate contiguous (i.e., an uninterrupted stretch of overlapping DNA)
consensus sequence representing a haploid-phase version of the genome (one for each parental
haplotype) of the target species. The costs of acquiring sufficient sequence data for such an
assembly have now dropped to a level that most laboratories can afford. This has led to the recent
explosion of plant species being sequenced. Four questions must be considered when
embarking on a new genome assembly project are: (i) how big is the genome?; (ii) is it a diploid,
polyploid, and/or highly heterozygous hybrid species?; (iii) how much repetitive sequence is likely
to be present in the genome; and (iv) what is the best experimental and computational design to
be employed?
Most large plant genomes have high levels of repeated and duplicated sequences owing to
whole-genome, chromosomal, subchromosomal, or tandem duplications (e.g., transposable
element activity) [1,2]. With genome assemblies based on short-read (75–700 bp) data, the
repeats and duplications are often not well resolved, leading to the bioinformatic formation of
chimeric sequences (see Glossary) and fragmented contigs. Third-generation sequencing
platforms (Pacific Biosciences, PacBio and Oxford Nanopore Technologies, ONT), that generate
individual read-lengths from 8 kb to 40 kb (maximum N150 kb for PacBio and N2 Mb for ONT) [3],
give much better resolution and contiguity. Nevertheless, some regions of a genome, such as
the telomeric and centromeric regions of chromosomes, are often poorly resolved because
they can contain megabases of repeated sequences. Current bioinformatic software does not
cope well with these difficult regions, especially in the complex and polyploid genomes of many
Highlights
Tumbling sequencing costs, improve-
ments in bioinformatic pipelines, and
increased access to high-performance
computing capabilities have resulted in
a perfect storm where nonspecialist
genomics research groups are able to
access, deploy, and generate de novo
genome sequences in nonmodel plant
systems.
However, generating a high-quality as-
sembly for many plant species still pre-
sents significant challenges owing to
genome size, complexity, and experi-
mental and computational design.
Selecting the most appropriate se-
quencing and software platforms for a
new genome project can be confusing
and daunting because of the wide
spectrum of available options and the
performance quality of specific tools in
different contexts.
1
Centre for Tropical Crops and
Biocommodities, Queensland University
of Technology, Brisbane, QLD 4001,
Australia
2
Department of Wine, Food, and
Molecular Biosciences, Lincoln
University, 7647 Christchurch,
New Zealand
3
Department of Bioscience, University of
Milan, Milan 20133, Italy
4
School of Plants and Environmental
Sciences, Virginia Tech, Blacksburg, VA
24061, USA
5
School of Earth, Environmental, and
Biological Sciences, Queensland
University of Technology, Brisbane,
QLD, 4001, Australia
6
School of Biological Sciences,
University of Sydney, Sydney, NSW
2006, Australia
Trends in Plant Science, Month 2019, Vol. xx, No. xx https://doi.org/10.1016/j.tplants.2019.05.003 1
© 2019 Elsevier Ltd. All rights reserved.
Trends in Plant Science
TRPLSC 1813 No. of Pages 25