Exploring Essential Attributes For Detecting MicroRNA Precursors From Background Sequences Yun Zheng, Wynne Hsu, Mong Li Lee and Lim Soon Wong Department of Computer Science, School of Computing National University of Singapore, Singapore 117543 {zhengy, whsu, leeml, wongls}@comp.nus.edu.sg Abstract. MicroRNAs (miRNAs) have been shown to play important roles in post-transcriptional gene regulation. The hairpin structure is a key characteristic of the microRNAs precursors (pre-miRNAs). How to encode their hairpin structures is a critical step to correctly detect the pre-miRNAs from background sequences, i.e., pseudo miRNA precur- sors. In this paper, we have proposed to encode the hairpin structures of the pre-miRNA with a set of features, which captures both the global and local structure characteristics of the pre-miRNAs. Furthermore, we find that four essential attributes are discriminatory for classifying hu- man pre-miRNAs and background sequences with an information theory approach. The experimental results show that the number of conserved essential attributes decreases when the phylogenetic distance between the species increases. Specifically, one A-U pair, which produces the U at the start position of most mature miRNAs, in the pre-miRNAs is found to be well conserved in different species for the purpose of biogenesis. 1 Introduction MicroRNAs (miRNAs) are small non-coding RNAs of about 22 nucleotides long. More and more evidences show that miRNAs play important roles in gene reg- ulation and various biological processes, as reviewed in [1–3]. MicroRNAs tran- scripts, which may be produced by RNA polymerase II or III [3], often fold to form stem loop structures, and become what are called primary miRNAs, or pri-miRNAs. In the nucleus, the Drosha RNase III endonuclease cleavages both strands of the stem at the base of the primary stem loop [4], and produce the pre-miRNAs. Then, in cytoplasm, a second RNase III endonuclease, Dicer, to- gether with its dsRNA-binding partner protein makes a second pair of cuts and defines the other end of the mature miRNAs (see example in Figure 1), which produces the miRNA:miRNA ∗ duplex. Finally, the miRNA stand is separated from the duplex by the helicase and form the mature miRNA molecules [2–4]. The mature miRNAs are then loaded to RNA-induced silencing complex (RISC), which binds the 3 ′ untranslate region of messenger RNAs of the miRNA target genes to repress the production of related proteins [3, 5].