1 Scientific RepoRts | 7: 3935 | DOI:10.1038/s41598-017-03996-z www.nature.com/scientificreports De novo yeast genome assemblies from MinIoN, pacBio and Miseq platforms Francesca Giordano 1 , Louise Aigrain 1 , Michael A Quail 1 , paul Coupland 2 , James K Bonfeld 1 , Robert M Davies 1 , German tischler 3 , David K Jackson 1 , thomas M Keane 1 , Jing Li 4 , Jia-Xing Yue 4 , Gianni Liti 4 , Richard Durbin 1 & Zemin Ning 1 Long-read sequencing technologies such as Pacifc Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacifc Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three diferent platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics afect the assembly quality. With a given read depth of 31X, the assemblies from both Pacifc Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a signifcant challenge. Te advent of next generation sequencing technologies (NGS) has marked the start of a new era in genomics research. Compared to the previous Sanger technology 1 , NGS has signifcantly lowered the cost of sequencing using massively parallel sequencing methods 2, 3 . In a typical NGS run, DNA molecules are sheared into small frag- ments and then clonally amplifed before being sequenced. Afer DNA amplifcation, multiple fragments of the sequences obtained may cover the same genome region, so that computational algorithms can be used to concat- enate and assemble such reads like a jigsaw puzzle and generate a consensus to correct for the occasional sequenc- ing errors. Te typical length of the DNA fragments sequenced is between 50 and 400 bases long 2 , and as a result, the assembly obtained from such short reads is fragmented in contigs much smaller than the actual chromosome sizes. In particular, short reads are not able to solve complex genome features like repeated regions (repeats) longer than the fragment length or copy number variations, with the typical outcome that (almost-) identical repeats are collapsed into a single element in the assembly. To overcome the high fragmentation of NGS-based assemblies and to help resolve long repeats, long-read sequencing technologies have been developed and recently adopted by the genomics community. Te main characteristic of these new platforms is to work with long DNA molecules and provide reads with lengths up to hundreds of kilobases (kb). Reads of such length can be exploited in various ways. Particularly in the genome assembly feld they can be used for de novo assembly with long-read data only, or for scafolding of NGS-based assemblies by bridging gaps between contigs or spanning long repeats thus resolving them. A major drawback of long-read technologies is the higher rate of sequencing errors (5–20%) compared to NGS data (<1%) 2 . Such an error profle could negatively afect the assembly accuracy, but because the errors are mostly randomly distributed the majority of long-read assemblers adopt the strategy of correcting base errors algorithmically before attempting to assemble the reads. 1 the Wellcome trust Sanger institute, Wellcome trust Genome campus, Hinxton, cambridge, UK. 2 cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge, CB2 0RE, UK. 3 Max Planck institute of Molecular Cell Biology and Genetics, Pfotenhauerstraße 108, 01037, Dresden, Germany. 4 Université côte d’Azur, CNRS, INSERM, IRCAN, Nice, France. Correspondence and requests for materials should be addressed to F.G. (email: francesca.giordano@sanger.ac.uk) Received: 17 January 2017 Accepted: 8 May 2017 Published: xx xx xxxx opeN