Genome-wide Search for Coaxial Helical Stacking Motifs Kevin Byron, Jason T. L. Wang, Dongrong Wen Bioinformatics Program and Department of Computer Science New Jersey Institute of Technology Newark, New Jersey 07102, USA {byron, wangj, dw39}@njit.edu AbstractMotif finding in DNA, RNA and proteins plays an important role in life science research. In this paper, we present a computational approach to searching for RNA tertiary motifs in genomic sequences. Specifically, we describe a method, named CSminer, and show, as a case study, the application of CSminer to genome-wide search for coaxial helical stackings in RNA 3-way junctions. A coaxial helical stacking motif occurs in an RNA 3-way junction where two separate helical elements form a pseudocontiguous helix and provide thermodynamic stability to the RNA molecule as a whole. Experimental results demonstrate the effectiveness of our approach. Keywords- coaxial helical stacking, genome-wide motif finding, RNA junction I. INTRODUCTION Motif finding in DNA, RNA and proteins plays an important role in life science research. Here, we present a method, named CSminer (i.e. Coaxial helical Stacking miner), for finding coaxial helical stackings in genomes. A coaxial helical stacking occurs in an RNA tertiary structure where two separate helical elements form a pseudocontiguous helix [1]. Coaxial helical stacking motifs occur in several large RNA structures, including tRNA [2], pseudoknots [3], group II intron [4] and large ribosomal subunits [5][6][7]. Coaxial helical stackings provide thermodynamic stability to the molecule as a whole [8][9], and reduce the separation between loop regions within junctions [10]. Moreover, coaxial helical stacking interactions form cooperatively with long-range interactions in many RNAs [11] and are thus essential features that distinguish different junction topologies. Research to unravel the mysteries of (non-coding) RNA is exciting. An unexpected preliminary result of the human ENCODE project indicates that whereas protein-coding sequences (i.e. coding RNA) occupy less than 2% of the human genome, close to 93% of the genome is transcribed into non-coding RNA [12]. The “RNA World” hypothesis proposes that life based on RNA pre-dates the current world of life based on DNA, RNA and proteins [13]. Specialized RNA literature continually emerges [14]. The function of RNA is believed to be closely associated with its 3D structure, which, by virtue of canonical Watson-Crick base pairings (i.e. AU, GC) and wobble base pairing (i.e. GU), is largely determined by its secondary structure [15][16][17]. Many secondary structure prediction tools are available. One of the more highly regarded of these tools is Infernal [18] which has been, and continues to be, frequently cited [19] [20]. Infernal applies stochastic context-free grammar methodology to efficiently predict (non-coding) RNA secondary structures in genome-wide searches [21][22][23]. Databases detailing the 3D structure and features of RNA continue to grow [24][25]. Special interest is paid to RNA junctions [26][27] in which there are one or more coaxial helical stackings [28][29]. Statistical analysis approaches, in particular, ensemble-based approaches, have been successful in non-life science applications [30][31]. Recently, these ensemble-based approaches have been successful in the field of bioinformatics [32][33][34][35][36]. We apply an ensemble-based approach, namely random forests, to predict the existence of a coaxial helical stacking in RNA junctions [1]. In this paper, we extend the functionality of Infernal to create a tool, named CSminer, which can efficiently predict the existence of coaxial helical stackings in genomes. This is accomplished by invoking a random forests classifier within Infernal and filtering Infernal results appropriately. Changes to the Infernal source code are available from the authors upon request. II. MATERIAL AND METHODS A. RNA 3-Way Junctions For this work, we selected samples from known RNA junctions. There are multiple ways for an RNA junction to exist [37]. As a case study, we focus on 3-way junctions here. In [1], we studied 110 distinct RNA 3-way junctions confirmed in available crystal structures. Each 3-way junction contains a multi-branch loop (i.e. MBL) with three helices. Each of these 110 unique junctions is verified in one of 32 crystal structure molecules in PDB [24]. The majority, 75%, of these 110 3-way junctions are found in the relatively complex ribosome subunit molecules, i.e. 51% in 23S rRNA, 20% in 16S rRNA and 4% in 5S rRNA. There is no dominant topological configuration among these 110 3-way junctions in that 47% are categorized as family type C, 35% as family type A and the remaining 18% as family type B [1]. For each of these 110 3-way junctions, the coaxial helical stacking status is known, and the status is one of these four possibilities: H1H2, H1H3, H2H3 or none, where HxHy indicates that helix Hx shares a common axis with helix Hy. Following [1], a 3-way junction is described by three RNA subsequences. For each subsequence, base coordinates and base values (i.e. A, C, G, U) are known. The starting and ending coordinates of each subsequence indicate the 5’ Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), Larnaca, Cyprus, 11-13 November 2012 978-1-4673-4358-9/12/$31.00 ©2012 IEEE 260