Genome-wide Search for Coaxial Helical Stacking Motifs
Kevin Byron, Jason T. L. Wang, Dongrong Wen
Bioinformatics Program and Department of Computer Science
New Jersey Institute of Technology
Newark, New Jersey 07102, USA
{byron, wangj, dw39}@njit.edu
AbstractMotif finding in DNA, RNA and proteins plays an
important role in life science research. In this paper, we
present a computational approach to searching for RNA
tertiary motifs in genomic sequences. Specifically, we describe
a method, named CSminer, and show, as a case study, the
application of CSminer to genome-wide search for coaxial
helical stackings in RNA 3-way junctions. A coaxial helical
stacking motif occurs in an RNA 3-way junction where two
separate helical elements form a pseudocontiguous helix and
provide thermodynamic stability to the RNA molecule as a
whole. Experimental results demonstrate the effectiveness of
our approach.
Keywords- coaxial helical stacking, genome-wide motif finding,
RNA junction
I. INTRODUCTION
Motif finding in DNA, RNA and proteins plays an
important role in life science research. Here, we present a
method, named CSminer (i.e. Coaxial helical Stacking
miner), for finding coaxial helical stackings in genomes. A
coaxial helical stacking occurs in an RNA tertiary structure
where two separate helical elements form a
pseudocontiguous helix [1]. Coaxial helical stacking motifs
occur in several large RNA structures, including tRNA [2],
pseudoknots [3], group II intron [4] and large ribosomal
subunits [5][6][7]. Coaxial helical stackings provide
thermodynamic stability to the molecule as a whole [8][9],
and reduce the separation between loop regions within
junctions [10]. Moreover, coaxial helical stacking
interactions form cooperatively with long-range interactions
in many RNAs [11] and are thus essential features that
distinguish different junction topologies.
Research to unravel the mysteries of (non-coding) RNA
is exciting. An unexpected preliminary result of the human
ENCODE project indicates that whereas protein-coding
sequences (i.e. coding RNA) occupy less than 2% of the
human genome, close to 93% of the genome is transcribed
into non-coding RNA [12]. The “RNA World” hypothesis
proposes that life based on RNA pre-dates the current world
of life based on DNA, RNA and proteins [13]. Specialized
RNA literature continually emerges [14]. The function of
RNA is believed to be closely associated with its 3D
structure, which, by virtue of canonical Watson-Crick base
pairings (i.e. AU, GC) and wobble base pairing (i.e. GU), is
largely determined by its secondary structure [15][16][17].
Many secondary structure prediction tools are available. One
of the more highly regarded of these tools is Infernal [18]
which has been, and continues to be, frequently cited [19]
[20]. Infernal applies stochastic context-free grammar
methodology to efficiently predict (non-coding) RNA
secondary structures in genome-wide searches [21][22][23].
Databases detailing the 3D structure and features of RNA
continue to grow [24][25]. Special interest is paid to RNA
junctions [26][27] in which there are one or more coaxial
helical stackings [28][29]. Statistical analysis approaches, in
particular, ensemble-based approaches, have been successful
in non-life science applications [30][31]. Recently, these
ensemble-based approaches have been successful in the field
of bioinformatics [32][33][34][35][36]. We apply an
ensemble-based approach, namely random forests, to predict
the existence of a coaxial helical stacking in RNA junctions
[1]. In this paper, we extend the functionality of Infernal to
create a tool, named CSminer, which can efficiently predict
the existence of coaxial helical stackings in genomes. This is
accomplished by invoking a random forests classifier within
Infernal and filtering Infernal results appropriately. Changes
to the Infernal source code are available from the authors
upon request.
II. MATERIAL AND METHODS
A. RNA 3-Way Junctions
For this work, we selected samples from known RNA
junctions. There are multiple ways for an RNA junction to
exist [37]. As a case study, we focus on 3-way junctions
here. In [1], we studied 110 distinct RNA 3-way junctions
confirmed in available crystal structures. Each 3-way
junction contains a multi-branch loop (i.e. MBL) with three
helices. Each of these 110 unique junctions is verified in one
of 32 crystal structure molecules in PDB [24]. The majority,
75%, of these 110 3-way junctions are found in the relatively
complex ribosome subunit molecules, i.e. 51% in 23S rRNA,
20% in 16S rRNA and 4% in 5S rRNA. There is no
dominant topological configuration among these 110 3-way
junctions in that 47% are categorized as family type C, 35%
as family type A and the remaining 18% as family type B
[1]. For each of these 110 3-way junctions, the coaxial
helical stacking status is known, and the status is one of these
four possibilities: H1H2, H1H3, H2H3 or none, where HxHy
indicates that helix Hx shares a common axis with helix Hy.
Following [1], a 3-way junction is described by three
RNA subsequences. For each subsequence, base coordinates
and base values (i.e. A, C, G, U) are known. The starting
and ending coordinates of each subsequence indicate the 5’
Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics
& Bioengineering (BIBE), Larnaca, Cyprus, 11-13 November 2012
978-1-4673-4358-9/12/$31.00 ©2012 IEEE 260