RNAMAT: An Efficient Method to Detect Classes of RNA Molecules and their Structural Features Yair Horesh * , Amihood Amir * , Shulamit Michaeli + , Ron Unger + * Department of Computer Science and + Faculty of Life Science Bar-Ilan University, Ramat-Gan 52900, Israel Abstract: There is a growing appreciation for the diverse and important roles RNA molecules play in cellular function. RNAMAT is an approach based on matrix representation of all potential base-pairing of a set of sequences to reveal common secondary-structure features. When the RNA sequences come from one class, proper summation of these matrices exposes common structural features as demonstrated for tRNA and HACA-RNA. For C/D-RNA, a novel structural motif is suggested. Furthermore, it is demonstrated, in the case of tmRNA that the method can detect pseudo-knots which are structural motifs that are difficult to detect in other methods. When the sequences come from diverse sources, a specific clustering algorithm is suggested that is capable of detecting the common motifs. The algorithm is demonstrated in a case of a simulated example and in a real case derived from Trypanosomes comparative RNomics study. Keywords: Clustering, RNA folding, Pseudo-knots, Dotplot I. INTRODUCTION In recent years it became clear that in addition to their fundamental role in translating DNA into proteins (tRNA, mRNA, rRNA), RNA molecules play significant roles in diverse cellular processes such as ribosome RNA maturation and modification (snoRNA); Replication (telomerase RNA); Editing (RNA editing, e.g. serotenin receptor); Protein translocation (SRP RNA); Translation quality control in prokaryotes (tmRNA); gene silencing (miRNA) and more. For a recent review see [1]. The realization of how important RNA molecules are in cellular processes [2] is the motivation behind recent efforts for RNomics, a systematic study to identify all the RNA molecules utilized by an organism [3]. However, RNomics is much more difficult to study than Genomics: For coding genes (ORF), high throughput experimental methods combined with bioinformatic methods have been successfully developed to identify the sequence, the genomic location, the structure and the function of genes. There has been much less progress in identifying and classifying non-coding DNA that produce functional RNA molecules. The experimental approach must contend with RNA molecules that are short, short lived, expressed in small quantities, susceptible to experimental procedure and often expressed only in specific tissues or developmental stages. The computational methods suffer from the fact that RNA molecules do not carry signals (like TATA box, promoters, ORFs between starting and stop codons, codon periodicity, etc.) that are helpful in identifying coding genes. The two main features that enable detection of RNA structures are conservation of short stretches (say of less than 150 bp, too short to code for proteins) between related species (comparative RNomics), and the fact that RNA molecules have the ability to form, as a single strand, secondary structure. RNA secondary structure are stems that are formed by complementary base matching of inverted repeats, see Fig. 1e for an example of the secondary structure of tRNA. In many classes of RNA molecules, the level of sequence similarity is low, and the similarity is based on the conserved pattern of the secondary structure elements. For known classes of RNA, algorithms are available to scan effectively genomes (e.g. tRNA-scan [4]) and identify sequences that belong to that class. When the structure is unknown, the problem is more difficult. Most current RNA structure prediction methods are based on energy calculations that aim to find the "optimal" secondary structure for a given sequence. The original algorithm [5] maximizes the number of complemented base pairs. The dynamic programming algorithm, assuming that all base-pairs are nested (no pseudo-knots are allowed), runs in cubic time in the sequence length. More elaborate algorithms (notably the pioneering package Mfold [6] and the Vienna package [7]) try to minimize the free energy of the structure using empirical parameters to evaluate the different energetic contribution of different base pairing. However, these predictions are not very reliable, especially for short sequences where several very different structures are suggested with quite similar energy scores. For example, in a test that we have done (data not shown) less than 25% of human tRNA molecules were folded by Mfold to the well known clover-leaf structure. Fig. 1. tRNA. Top: three individual matrices of tRNA molecules. No common features can be identified. Bottom: (d) 196 tRNA sequences were summed into an accumulated matrix, and the major peaks are shown. (e) From the accumulated matrix it is simple to reconstruct the clover leaf structure of tRNA.