MicroRNA Target Recognition from Compositional Features of Aligned microRNA-mRNA Duplexes Hasan Oğul 1 , Çiğdem Beyan 1 , Öykü Eren Özsoy 1 , Kerem Yıldız 1 Tülin Erçelebi Ayyıldız 1 , Arzu Burçak Sönmez 2 1 Department of Computer Engineering, Başkent University Eskişehir Road 20.km, Bağlıca Campus, 06810, Ankara 2 Department of Computer Engineering, Çankaya University Öğretmenler Street, 06530, Balgat, Ankara {hogul,cbeyan,oeren,kyildiz,ercelebi}@baskent.edu.tr, aburcak@cankaya.edu.tr Abstract The discovery of microRNAs (miRNAs), small non-coding RNAs that regulate gene expression at the post-transcriptional level, has led to a shift in our understanding of the complexity in gene regulatory networks. One key question in understanding the effect of an individual miRNA is to identify its potential targets in a local or genome-wide regulation process. We define here a novel algorithm for computational prediction of miRNA target genes using solely the sequence information for mature miRNA and potential target mRNAs. Our method considers the aligned miRNA-mRNA duplex as a new sequence and extracts compositional features from newly constructed sequence to be fed into a Naive Bayes Classifier. A rigorous analysis in a common benchmark set reveals that, in terms of prediction accuracy, the new algorithm outperforms most of the available methods, while being competitive with more complex and relatively inefficient methods. 1. Introduction MicroRNAs (miRNAs) are small, stable ~22 nucleotide non-coding RNA molecules that found in bacteria, animals and plants. They take role in posttranscriptional gene regulation and have distinct mechanism to down-regulate gene expression. Moreover, they play crucial roles in cellular processes including cell proliferation [1], apoptosis [2], development [3], differentiation [4], and metabolism [5]. Hence, identification of miRNAs‟ targets is an important step to understand miRNA functions. Till date, many methods have been developed for the prediction of miRNAs and miRNAs‟ target genes. The majority of these methods are based on sequence alignment or minimum free energy of the hybridization [6]. Partial complementarity between miRNA‟s 5‟ end (seed of length 6-8 nt) and 3‟ untranslated region (3‟ UTR) of the target mRNA, thermodynamic duplex stability, multiple target sites in the 3‟ UTRs of target mRNA and evolutionary conserved target sites are also parameters of many target prediction methods [7]. In addition, [8] proposes miRNA‟s out-seed segment to determine targets. TargetScan [9] and PicTar [10] are examples of miRNA target prediction methods based on sequence complementarity to target sites with base-pairing in the seed region. TargetScan [9] in general focuses on aligned seed matches and their conservations. The score of an mRNA is calculated according to level of conservation, the position that it binds, and the distance from the 3‟ UTR end. PicTar [10] is a computational method that identifies miRNA-target interactions by complementarity between seed region of miRNA and 3‟ UTR target site and also considers bulge, mismatch, G: U wobble. To score the likelihood of a 3‟ UTR a Hidden Markov Model is used. RNAHybrid [11] on the other hand based on calculations of mRNA secondary structure and energetically favorable hybridization between miRNA and target mRNA. miRanda [7] is also a method to find the miRNA:mRNA duplexes by focusing on the alignment scores, conserved miRNA target sites and free duplex energy. Similar to [10], miRanda [7] also considers mismatch, G: U wobble pairs and matching position while identifying the sequence complementarity scores. Another miRNA:mRNA duplex detection approach is RNA22 [12]. As a pattern-based approach, RNA22 uses known miRNA sequences to obtain sequence features of miRNA binding sites. These features are represented by patterns and the reverse complement of these patterns are used to detect the miRNA:mRNA duplexes. Unlike the methods [9, 10], RNA22 approach does not depend on the sequence conservation.