Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms, pages 57–64, Sydney, July 2006. c 2006 Association for Computational Linguistics Stochastic Multiple Context-Free Grammar for RNA Pseudoknot Modeling Yuki Kato Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan yuuki-ka@is.naist.jp Hiroyuki Seki Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan seki@is.naist.jp Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan kasami@naist.jp Abstract Several grammars have been proposed for modeling RNA pseudoknotted struc- ture. In this paper, we focus on multiple context-free grammars (MCFGs), which are natural extension of context-free gram- mars and can represent pseudoknots, and extend a specific subclass of MCFGs to a probabilistic model called SMCFG. We present a polynomial time parsing algo- rithm for finding the most probable deriva- tion tree and a probability parameter esti- mation algorithm. Furthermore, we show some experimental results of pseudoknot prediction using SMCFG algorithm. 1 Introduction Non-coding RNAs fold into characteristic struc- tures determined by interactions between mostly Watson-Crick complementary base pairs. Such a base paired structure is called the secondary struc- ture. Pseudoknot (Figure 1 (a)) is one of the typi- cal substructures found in the secondary structures of several RNAs, including rRNAs, tmRNAs and viral RNAs. An alternative graphic representation of a pseudoknot is arc depiction where arcs con- nect base pairs (Figure 1 (b)). It has been rec- ognized that pseudoknots play an important role in RNA functions such as ribosomal frameshifting and regulation of translation. Many attempts have so far been made at mod- eling RNA secondary structure by formal gram- mars. In a grammatical approach, secondary struc- ture prediction can be viewed as parsing problem. However, there may be many different derivation trees for an input sequence. Thus, it is necessary to have a method of extracting biologically realistic 5’-C A G G • • • U C C A G U • • • U C A G-3’ C G C (a) Pseudoknot c a g g c u g a c c u g c u c a g (b) Arc depiction of (a) Figure 1: Example of RNA secondary structure derivation trees among them. One solution to this problem is to extend a grammar to a probabilistic model and find the most likely derivation tree, and another is to take free energy minimization into ac- count. Eddy and Durbin (1994), and Sakakibara et al. (1994) modeled RNA secondary structure with- out pseudoknots by using stochastic context-free grammars (stochastic CFGs or SCFGs). For pseu- doknotted structure (Figure 1 (a)), however, an- other approach has to be taken since a single CFG cannot represent crossing dependencies of base pairs in pseudoknots (Figure 1 (b)) for the lack of generative power. Brown and Wilson (1996) pro- posed a model based on intersections of SCFGs to describe RNA pseudoknots. Cai et al. (2003) introduced a model based on parallel communi- cation grammar systems using a single CFG syn- chronized with a number of regular grammars. Akutsu (2000) provided dynamic programming al- gorithms for RNA pseudoknot prediction without using grammars. On the other hand, several gram- mars have been proposed where the grammar itself can fully describe pseudoknots. Rivas and Eddy (1999, 2000) provided a dynamic programming 57