Research Article
RECORD: Reference-Assisted Genome Assembly for
Closely Related Genomes
Krisztian Buza, Bartek Wilczynski, and Norbert Dojer
Faculty of Mathematics, Informatics and Mechanics (MIM), University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
Correspondence should be addressed to Krisztian Buza; buza@biointelligence.hu
Received 18 March 2015; Revised 27 May 2015; Accepted 31 May 2015
Academic Editor: Chun-Yuan Lin
Copyright © 2015 Krisztian Buza et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single
experiment. Tis is enough information to reconstruct at least some of the diferences between the individual genome studied in
the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and
the reference genome is used. Results. We provide a new approach that allows researchers to reconstruct genomes very closely
related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach
applies de novo assembly sofware to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a
modifed reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modifed reference
sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its
implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our sofware publicly available
on sourceforge. Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly
sofware.
1. Background
Te emergence of population genomic projects leads to an
ever growing need for sofware and methods that facili-
tate studying closely related organism with next-generation
sequencing technologies. Tis includes determination of the
genomic sequences of individuals in the presence of the more
generic reference genome of the species. Tis task is known
as reference-assisted genome assembly and many ongoing
research projects depend on the accurate solution for this
problem.
In recent years, next-generation sequencing technologies
have brought us the possibility to simultanously sequence
millions of short DNA fragments in a DNA library prepared
from almost any biochemical experiment [1]. Great improve-
ment in the quality and amount of short reads obtained
from a single experiment allowed for development of many
more biochemical assays [2] such as MNase-seq [3], DNAse-
seq [4], or Chia-Pet [5] in addition to the more standard
ChIP-Seq [6] or RNA-seq [7]. Similarly, the next-generation
sequencing techniques may be applied to metagenomic sam-
ples returning short reads originating from multiple genomes
including some potentially unknown species.
Importantly, many of these techniques require the prior
knowledge of the reference genome of the species for which
the experiment was performed. Tis genome sequence is
used to map the reads and obtain the fnal readout of the
experiment as the read counts per base pair. Such procedures
are guaranteed to work very well only under the assumption
that we know the exact sequence of the genome under study.
Tere are, however, many biologically relevant cases when
this assumption cannot be satisfed. For example, in quickly
growing cell populations such as cancer cell-lines or micro-
bial colonies, even rare mutations can get fxed in the pop-
ulation very quickly. Tis leads to situations where sampled
sequences can signifcantly difer from the original reference
genome. Similarly, many lab experiments involve genetically
modifed cells or organisms. While these modifcations are
usually controlled as much as possible, the researchers fre-
quently do not know the exact landing site of the introduced
Hindawi Publishing Corporation
International Journal of Genomics
Volume 2015, Article ID 563482, 10 pages
http://dx.doi.org/10.1155/2015/563482