1 Scientific RepoRts | 6:27722 | DOI: 10.1038/srep27722 www.nature.com/scientificreports Genomic leftovers: identifying novel microsatellites, over- represented motifs and functional elements in the human genome Natalie C. Fonville * , Karthik Raja Velmurugan * , Hongseok tae, Zalman Vaksman, Lauren J. McIver & Harold R. Garner The human genome is 99% complete. This study contributes to flling the 1% gap by enriching previously unknown repeat regions called microsatellites (MST). We devised a Global MST Enrichment (GME) kit to enrich and nextgen sequence 2 colorectal cell lines and 16 normal human samples to illustrate its utility in identifying contigs from reads that do not map to the genome reference. the analysis of these samples yielded 790 novel extra-referential concordant contigs that are observed in more than one sample. We searched for evidence of functional elements in the concordant contigs in two ways: (1) BLAST-ing each contig against normal RNA-Seq samples, (2) Checking for predicted functional elements using GlimmerHMM. Of the 790 concordant contigs, 37 had an exact match to at least one RNA-Seq read; 15 aligned to more than 100 RNA-Seq reads. Of the 249 concordant contigs predicted by GlimmerHMM to have functional elements, 6 had at least one exact RNA-Seq match. BLAST-ing these novel contigs against all publically available sequences confrmed that they were found in human and chimpanzee BAC and FOSMID clones sequenced as part of the original human genome project. These extra-referential contigs predominantly contained pentameric repeats, especially two motifs: AATGG and GTGGA. In April of 2003 the Human Genome Project was declared complete, and from it we gained a framework to build the reference genome upon which the majority of analyses are anchored. Te scope of the Human Genome Project was focused on the 94% of the genome that is euchromatin 1 , now sequenced to 99% completion 2 . Attempts are being made to complete the 1% of the incomplete “complete” human reference 3 , however we and others have hypothesized that some genomic sequence regions (which may contain functional elements, genes) may be miss- ing from the human reference because they are embedded in refractory repetitive DNA sequence, e.g. microsatel- lites (MSTs) 4 . MST sequences, regions of repeated 1- to 6-mer DNA motifs, are abundant throughout the genome and are a source of signifcant genomic variation 5 . However, to date, analysis of microsatellite-containing loci has been limited because standard exome enrichment and whole genome sequencing uses sofware to mask out repeats 6 , focuses on capturing non-repetitive DNA, or is designed to capture only a small subset of the known MST loci 7 . In this paper we present a novel target enrichment strategy specifcally designed to enrich for all microsatellite loci based on the repeat motif, rather than the fanking sequence, as baits, and have paired this tech- nique with our recently developed method for analysis of unmapped reads 8 . Our analysis has revealed: 1) assem- bly of contigs from unmapped genome sequences and high-depth sequences from this novel target enrichment system that specifcally selects for repetitive elements enables the quantifcation and characterization of these regions; 2) concordant contigs, those that appear in multiple samples, contain new structural elements (potential genes/pseudogenes, etc.), a subset of which have high similarity to expressed mRNAs; 3) these extra-referential genome regions are dominated by 5-mer repeats, in particular, an AATGG and a GTGGA centromeric repeat. Tis platform technology has the potential to extend “reference genomes” and identify new functional elements. Via Bioinformatics and Clinical Genetics Network, Edward Via College of Osteopathic Medicine, 2265 Kraft Dr, Blacksburg, VA 24060, USA. * These authors contributed equally to this work. Correspondence and requests for materials should be addressed to H.R.G. (email: skipgarner@gmail.com) Received: 15 July 2015 accepted: 23 May 2016 Published: 09 June 2016 OPEN