DNA sequencing has made possible many important discoveries in biology, medicine, and a lot of other areas. Additionaly, sequencing has historically posed many interesting computational challenges such as the DNA fragment assembly problem. The sequencing technology proposed by Sanger is more expensive and takes longer running times than new generation sequencing (NGS). On the other hand, NGS produces many more reads in a run, turning the assembly a much more complex process. In this work, we explore the de novo DNA fragment assembly problem modeled as a k-mer graph, providing a theoretical basis of a new method for finding paths in this sort of graphs in order to simplify the problem complexity. We aim at deeply investigating how to choose the best maximum matching in order to obtain a good approximation to the target consensus, taking shorter running times. In summary, at the end of this project, we intend to deliver an efficient and accurate general-purpose DNA fragment assembler. RESOLVING PROBLEMATIC TOPOLOGIES IN K-MER GRAPHS: A PROPOSAL OF A NEW METHOD FOR DNA FRAGMENT ASSEMBLY Introduction Couto A D 1 , Cerqueira F R 1,2 1 Departament of Informatics and 2 Research group for Bioinformatics, Federal University of Viçosa, Brazil. Acknowledgements Methodology Conclusion A common way to represent relationships between fragments being assembled is the application of k-mer graphs (Figure 1). In this case, each vertex represents a read subsequence of length k. Successive subsequences in a read will be represented by connected vertices in the form of a path. Such paths, in turn, represent overlaps of length k-1. This technique avoids high-cost pairwise alignments between the reads. Considering common difficulties such as repeats and base call errors, k-mer graphs normally take on undesired forms other than single paths. Accord to Miller at al (2010), there are four problematic topologies that may occur in k-mer graphs: Bubbles are formed due to base call errors in the middle of the reads (Figure 2); spurs are caused by base call errors, but at one end of the reads (Figure 3); frayed ropes are paths that converge and then diverge due to repeated regions in the molecule (Figure 4); finally, cycles are also induced by repeats. To deal with these problematic topologies, assemblers such as AllPaths (Butler et al, 2008), Velvet (Zerbino and Birney, 2008), and Euler (Chaisson, 2009) use intensive pre and post-processing on data and the resulting graph, meaning a high computational cost. In this work, we propose a new method to cope with problematic topologies using a single procedure to find maximum matching in a bipartite version of the k-mer graph. We show that a matching in the bipartite graph is equivalent to paths in the original graph. These paths indicate an order of the given fragments so that a consensus sequence can be easily obtained. There are polynomial algorithms to find maximum matching. We are working on the necessary modifications to obtain biologically meaningful matchings, as a given graph may have more than one maximum matching. This technique was proposed previously by Cerqueira and Meidanis (2001) for overlap graphs. However, the authors have not explored k-mer graphs and how the choice of a certain maximum matching would affect the quality of the assembly. cgta cgtaccgcc gtac tacc accg ccgc Figure 1: k-mer graph example. k=4. Real cases have higher values of k. Figure 3: Example of Spur. K-mer Graph G (k=3), its double (bipartite) version G’ and corresponding P paths. cgta gtac tacc accg tacg 1 2 3 4 6 ccgt 5 acgt 7 G 1 x 7 x 6 x 5 x 4 x 3 x 2 x 1 y 7 y 6 y 5 y 4 y 3 y 2 y G’ 1 5 2 3 4 6 7 P Figure 2: Example of Bubble. K-mer Graph G (k=3), its double (bipartite) version G’ and corresponding P paths. Solid lines in G’ represent the matching that will lead to paths P in G. acg cgt cga gtc gac tcc acc 1 2 3 4 6 5 7 G 8 acgt 1 x 7 x 6 x 5 x 4 x 3 x 2 x 1 y 7 y 6 y 5 y 4 y 3 y 2 y G’ 8 x 8 y 1 5 2 3 4 6 7 P 8 tcg cgt gtc tcg tcc acg 1 2 3 4 5 6 G 1 x 6 x 5 x 4 x 3 x 2 x 1 y 6 y 5 y 4 y 3 y 2 y G’ 1 2 3 4 5 6 P Figure 4: Example of Frayed Rope. K-mer Graph G (k=3), its double (bipartite) version G’ and corresponding P paths. Butler, et al., 2008. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research. Vol.18: 810-820. Cerqueira, F. R. and Meidanis, J., 2001. Algorithms for large scale DNA sequencing. In SEMISH 2001, proceedings of the Brazilian Computer Society Congress. Chaisson, M. J. et al., 2009. De novo fragment assembly with short mate- paired reads: Does the read length matter? Genome Res. Vol. 19, pp. 336- 346. Miller, J. R. et al., 2010. Assembly algorithms for next-generation sequencing data. Genomics. Vol. 85 (6), pp. 315-327. Zerbino, D. R. and Birney, E., 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research. Vol. 18, pp. 821-829. References This work is supported by CAPES. Copyright protected. F1000 Copyright protected. F1000 Posters. Copyright protected Copyright protected. F1000 Posters. Copyright protected. F1000 Posters. Copyright Copyright protected. F1000 Posters. Copyright protected. F1000 Posters. Copyright protected. F1000 Posters. C rotected. F1000 Posters. Copyright protected. F1000 Posters. Copyright protected. F1000 Posters. Copyright protecte opyright protected. F1000 Posters. Copyright protected. F1000 Posters. Copyright protecte osters. Copyright protected. F1000 Posters. Copyright protecte F1000 Posters. Copyright protecte