CORPUS BASED LEARNING OF STOCHASTIC CONTEXT-FREE GRAMMAR COMBINED WITH HIDDEN MARKOV MODELS FOR TRNA MODELLING Juan Miguel Garc´ ıa-G´ omez 1 , Jose Miguel Bened´ ı 2 1 Inform´ atica M´ edica-BET, Univ. Polit´ ecnica de Valencia 2 Dpto. Sistemas Informaticos y Computaci ´ on, Univ. Polit´ ecnica de Valencia Camino de Vera s/n, 46022, Valencia, SPAIN {juanmig@upvnet, jbenedi@dsic}.upv.es ABSTRACT tRNA molecule has a well-known second structure in which it folds by pairing of far-off nucleotides. This pa- per shows a Syntactic Pattern Recognition methodology for model tRNA second structure using stochastic context-free grammars. In order to learn models, structural regions (paired nucleotides) have been learned from categorized samples with full labelled tree with a Corpus based estimation algo- rithm. Non structural regions have been modelled by hidden Markov models and transformed to stochastic regular gram- mars to fusion together the structural regions. Test with positive samples and negative samples in comparison with Sakakibara achieved 1.81% in sequences error rate, 98.43% in Precision and 100% in Recall and 100% of SER in neg- ative test. Corpus based algorithm is computational time efficient and required less training samples for converge to the correct model of the tRNA second structure. Keywords- grammatical inference, language modelling, RNA, stochastic context-free grammars, syntactic pattern recognition 1. INTRODUCTION tRNA molecules are encoded by a linear string (the primary structure) of four different constituent nucleotides: adenine (A), cytosine (C), guanine (G) and uracil (U). Nucleotides of far positions interact forming the A-U and G-C Watson- Crick pairs as well as G-U base pairs. Structured regions are composed by paired nucleotides and they are responsible of the molecule folding. In order to model structured regions, we propose to use stochastic context-free grammas (SCFG). Non-structured regions are formed by free nucleotides sit- uated on externals loop motifs. In order to model non- structured regions we propose to use regular grammars, con- cretely learn models for each region with Hidden Markov Thanks to Diego Linares for answering all the questions about stochas- tic context-free grammars and Satoshi Sekine for his evaluation software. The authors thank the Ministerio de Sanidad y Consumo of Spain support- ing grant and the INBIOMED consortium Models (HMM). The fusion of structured and non-structured models will offer the general stochastic model for the recog- nition of tRNA molecules. There are 4 arms and 3 loops. The acceptor, D, T pseudouridine C and anticodon arms, and D, T pseudouridine C and anticodon loops. Sometimes tRNA molecules have an extra or variable loop tRNA. Grammatical structures and inference algorithms can be applied to the tRNA structure studying their behaviour in real problems. Concretely, the palindrome structure of struc- tured regions of the tRNA molecules, has a very interested linguistic pattern to applied estimation and interpretation al- gorithms of stochastic context-free grammars (SCFG) [1]. Modelling of tRNA molecules using Syntactic Pattern Recognition has been studied by some authors because of its interested second structure. Salvador and Benedi in [2] present results of the experiments using stochastic context- free grammars combined with n-grams in order to model the tRNA structure. The combination of stochastic context- free grammars of regions with far relationed nucleotides and regular grammars regions without far relations can ob- tain precise models of the tRNA molecules. Appropiated algorithms for learning each region will offer the compu- tational mechanism in order to solve the model. New ex- periments using Corpus-based algorithm have been carried out for modelling structural regions and the combination with Hidden Markov Models of non-structural regions have achieved satisfactory results in low computation time, we present this methodology on next sections. Sakakibara et al. in [3] introduce a model for estimat- ing SCFG models of the second structure and align tRNA sequences. For doing it, two approximations are followed: 1) Calculate the rules probability from aligned sequences by counting the nucleotides in each column using the pro- bability density of Dirichlet and 2) application of the EM algorithm that (by dynamic programming) obtains the pro- bability of each production using the in (probability of sub- trees) and out functions (probability of the rest tree with- out the node expanded). Sakakibara studies in [4] the mod- els capability, alignment and discrimination for seven tRNA