METHODS STITCH: Algorithm to Splice, Trim, Identify, Track, and Capture the Uniqueness of 16S rRNAs Sequence Pairs Using Public or In-house Database Dianhui Zhu & Parag A. Vaishampayan & Kasthuri Venkateswaran & George E. Fox Received: 27 September 2010 / Accepted: 13 November 2010 / Published online: 27 November 2010 # Springer Science+Business Media, LLC (outside the USA) 2010 Abstract A comparison of variable regions within the 16S rRNA gene is widely used to characterize relationships between bacteria and to identify phylogenetic affiliation of unknown bacteria. In environmental studies, polymerase chain reaction amplification of 16S rRNA followed by cloning and sequencing of numerous individual clones is an extensively used molecular method for elucidating micro- bial diversity. The sequencing process typically utilizes a forward and reverse primer pair to produce two partial reads (~700 to 800 base pairs each) that overlap and in total cover a large region of the full 16S rRNA sequence (~1.5 k base). In a typical application, this approach rapidly generates very large numbers of 16S rRNA datasets that can overwhelm manual processing efforts leading to both delays and errors. In particular, the approach presents two computational challenges: (1) the assembly of a composite sequence from the two partial reads and (2) the subsequent appropriate identification of the organism represented by the newly sequenced clones. Herein, we describe a software package, search, trim, identify, track, and capture the uniqueness of 16S rRNAs using public and in-house database (STITCH), which offers automated sequence pair splicing and genetic identification, thus simplifying the computationally intensive analysis of large sequencing libraries. The STITCH software is freely accessible over the Internet at: http://prion.bchs.uh.edu/stitch/. Introduction To infer the genetic affinity of a newly sequenced 16S rRNA gene, it has become a routine procedure to search for its closest relatives against public databases such as the National Center for Biotechnology Information (NCBI) [1, 2], the Remote Desktop Protocol (RDP) [3], and the SILVA [4]. Though genetic relationships can be assessed by comparing partial (~700 bp) 16S rRNA gene sequences [5, 6], comparisons of nearly complete 16S rRNAs (1.5 kb) is now widely preferred [7]. In theory, ribosomal genes from all cells (either 16S or 18S) present from a given environment, irrespective of cultivability and inclusive of novel taxa, are initially amplified by polymerase chain reaction (PCR) primers and then shunted into genetically amenable laboratory strains of Escherichia coli via suitable vectors [8–10]. The clones are then sequenced with Sanger sequencing methods that can read hundreds if not thousands of nucleotides [5, 11]. With the emergence of NexGen sequencing platforms, strategies to sequence and characterize environmental samples at an even deeper level are becoming attractive options. The Sanger sequencing approach is still relevant because these techniques lead to an overestimation of gene and taxon abundance and artificially inflate diversity estimates due to sequencing Electronic supplementary material The online version of this article (doi:10.1007/s00248-010-9779-2) contains supplementary material, which is available to authorized users. D. Zhu : G. E. Fox Department of Biology and Biochemistry, University of Houston, Houston, TX 77204-5001, USA P. A. Vaishampayan (*) : K. Venkateswaran Biotechnology and Planetary Protection Group, California Institute of Technology, Jet Propulsion Laboratory, M/S 89-102, 4800 Oak Grove Dr, Pasadena, CA 91109, USA e-mail: vaishamp@jpl.nasa.gov Present Address: D. Zhu Human Genome Sequencing Center, Baylor College of Medicine, Room Alkek N1619, Houston, TX 77030, USA Microb Ecol (2011) 61:669–675 DOI 10.1007/s00248-010-9779-2