631 ISSN 1022-7954, Russian Journal of Genetics, 2017, Vol. 53, No. 6, pp. 631–639. © Pleiades Publishing, Inc., 2017. Original Russian Text © K.S. Zadesenets, N.I. Ershov, N.B. Rubtsov, 2017, published in Genetika, 2017, Vol. 53, No. 6, pp. 641–650. Whole-Genome Sequencing of Eukaryotes: From Sequencing of DNA Fragments to a Genome Assembly K. S. Zadesenets a, *, N. I. Ershov a , and N. B. Rubtsov a, b a Institute of Cytology and Genetics Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090 Russia b Novosibirsk State University, Novosibirsk, 630090 Russia *e-mail: kira_z@bionet.nsc.ru Received July 25, 2016; in final form, September 6, 2016 AbstractRapid advances in sequencing technologies of second- and even third-generation made the whole genome sequencing a routine procedure. However, the methods for assembling of the obtained sequences and its results require special consideration. Modern assemblers are based on heuristic algorithms, which lead to fragmented genome assembly composed of scaffolds and contigs of different lengths, the order of which along the chromosome and belonging to a particular chromosome often remain unknown. In this regard, the result- ing genome sequence can only be considered as a draft assembly. The principal improvement in the quality and reliability of a draft assembly can be achieved by targeted sequencing of the genome elements of different size, e.g., chromosomes, chromosomal regions, and DNA fragments cloned in different vectors, as well as using reference genome, optical mapping, and Hi-C technology. This approach, in addition to simplifying the assembly of the genome draft, will more accurately identify numerical and structural chromosomal vari- ations and abnormalities of the genomes of the studied species. In this review, we discuss the key technologies for the genome sequencing and the de novo assembly, as well as different approaches to improve the quality of existing drafts of genome sequences. Keywords: read, contig, scaffold, de Bruijn graph, chromosome mapping, methods, DNA DOI: 10.1134/S102279541705012X DNA SEQUENCING AND ITS IMPORTANCE FOR MODERN BIOLOGY At present, DNA sequencing not only has become a key technology in many areas of modern biology but also has determined the discovery and further devel- opment of new directions. The history of sequencing started in the 1950s, when the methods of determina- tion of the amino acid sequences of polypeptide chains were developed, and the deciphering of the genetic code provided partial sequencing of the transcribed nucleic acid. In the late 1960s, the method of RNA sequencing was developed [1], which enabled W. Fiers et al. first to determine the sequence of the gene cod- ing for the bacteriophage MS2 coat protein [2] and, then, of its whole DNA [3]. Around the same time, the methods of direct DNA sequencing, including a plus– minus method [4], a method of chain termination [5], and a method of chemical degradation [6] were devel- oped. During the next decades, the Sanger sequencing technique became fully automated. Gel electrophore- sis and radioactively labeled nucleotides were replaced by capillary electrophoresis and nucleotides conju- gated to fluorochromes [7]. It was also possible to increase the DNA fragment read length in a single reaction to 500–1000 bp. The size limitation of sequenced DNA fragment was overcome by sequencing the overlapping frag- ments. A logical development of this approach was the method of shotgun sequencing based on random physical or chemical fragmentation of DNA template, cloning the resulting fragments (~2–3 kb), and their subsequent sequencing. Because of random fragmen- tation, the resulting fragments overlapped each other, so that, under the repeated coverage of the examined extended DNA fragment, there was the possibility of its assembly. This approach was successfully used more than 20 years ago with sequencing of the first bacterial genome. The genome of Haemophilus influ- enza was assembled from ~24 × 10 3 reads with the lengths of ~460 bp [8]. In the analysis of the genomes of prokaryotes containing small numbers of repeats, the complete sequence of the genome can be assem- bled from the analysis of sequenced fragments with a relatively small coverage (7–10×). The error probability in Sanger sequencing varies from 10 –5 to 10 –4 with the read length of about 1000 bp. Despite the fact that Sanger sequencing is still consid- ered as the gold standard and is widely used in various studies, it has some considerable drawbacks, including high per-base costs and low productivity [9]. Next- REVIEWS AND THEORETICAL ARTICLES