631
ISSN 1022-7954, Russian Journal of Genetics, 2017, Vol. 53, No. 6, pp. 631–639. © Pleiades Publishing, Inc., 2017.
Original Russian Text © K.S. Zadesenets, N.I. Ershov, N.B. Rubtsov, 2017, published in Genetika, 2017, Vol. 53, No. 6, pp. 641–650.
Whole-Genome Sequencing of Eukaryotes:
From Sequencing of DNA Fragments to a Genome Assembly
K. S. Zadesenets
a,
*, N. I. Ershov
a
, and N. B. Rubtsov
a, b
a
Institute of Cytology and Genetics Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090 Russia
b
Novosibirsk State University, Novosibirsk, 630090 Russia
*e-mail: kira_z@bionet.nsc.ru
Received July 25, 2016; in final form, September 6, 2016
Abstract⎯Rapid advances in sequencing technologies of second- and even third-generation made the whole
genome sequencing a routine procedure. However, the methods for assembling of the obtained sequences and
its results require special consideration. Modern assemblers are based on heuristic algorithms, which lead to
fragmented genome assembly composed of scaffolds and contigs of different lengths, the order of which along
the chromosome and belonging to a particular chromosome often remain unknown. In this regard, the result-
ing genome sequence can only be considered as a draft assembly. The principal improvement in the quality
and reliability of a draft assembly can be achieved by targeted sequencing of the genome elements of different
size, e.g., chromosomes, chromosomal regions, and DNA fragments cloned in different vectors, as well as
using reference genome, optical mapping, and Hi-C technology. This approach, in addition to simplifying
the assembly of the genome draft, will more accurately identify numerical and structural chromosomal vari-
ations and abnormalities of the genomes of the studied species. In this review, we discuss the key technologies
for the genome sequencing and the de novo assembly, as well as different approaches to improve the quality of
existing drafts of genome sequences.
Keywords: read, contig, scaffold, de Bruijn graph, chromosome mapping, methods, DNA
DOI: 10.1134/S102279541705012X
DNA SEQUENCING AND ITS IMPORTANCE
FOR MODERN BIOLOGY
At present, DNA sequencing not only has become
a key technology in many areas of modern biology but
also has determined the discovery and further devel-
opment of new directions. The history of sequencing
started in the 1950s, when the methods of determina-
tion of the amino acid sequences of polypeptide chains
were developed, and the deciphering of the genetic
code provided partial sequencing of the transcribed
nucleic acid. In the late 1960s, the method of RNA
sequencing was developed [1], which enabled W. Fiers
et al. first to determine the sequence of the gene cod-
ing for the bacteriophage MS2 coat protein [2] and,
then, of its whole DNA [3]. Around the same time, the
methods of direct DNA sequencing, including a plus–
minus method [4], a method of chain termination [5],
and a method of chemical degradation [6] were devel-
oped. During the next decades, the Sanger sequencing
technique became fully automated. Gel electrophore-
sis and radioactively labeled nucleotides were replaced
by capillary electrophoresis and nucleotides conju-
gated to fluorochromes [7]. It was also possible to
increase the DNA fragment read length in a single
reaction to 500–1000 bp.
The size limitation of sequenced DNA fragment
was overcome by sequencing the overlapping frag-
ments. A logical development of this approach was the
method of shotgun sequencing based on random
physical or chemical fragmentation of DNA template,
cloning the resulting fragments (~2–3 kb), and their
subsequent sequencing. Because of random fragmen-
tation, the resulting fragments overlapped each other,
so that, under the repeated coverage of the examined
extended DNA fragment, there was the possibility of
its assembly. This approach was successfully used
more than 20 years ago with sequencing of the first
bacterial genome. The genome of Haemophilus influ-
enza was assembled from ~24 × 10
3
reads with the
lengths of ~460 bp [8]. In the analysis of the genomes
of prokaryotes containing small numbers of repeats,
the complete sequence of the genome can be assem-
bled from the analysis of sequenced fragments with a
relatively small coverage (7–10×).
The error probability in Sanger sequencing varies
from 10
–5
to 10
–4
with the read length of about 1000 bp.
Despite the fact that Sanger sequencing is still consid-
ered as the gold standard and is widely used in various
studies, it has some considerable drawbacks, including
high per-base costs and low productivity [9]. Next-
REVIEWS
AND THEORETICAL ARTICLES