Kỷ yếu Hội nghị KHCN Quốc gia lần thứ XIII về Nghiên cứu cơ bản và ứng dụng Công nghệ thông tin (FAIR), Nha Trang, ngày 8-9/10/2020 DOI: 10.15625/vap.2020.00143 A NEW METHOD OF RNA SECONDARY STRUCTURE PREDICTION BASED ON GENETICS ALGORITHMS AND MACHINE LEARNING Doan Duy Binh 1 , Pham Minh Tuan 2 and Dang Duc Long 3 1 The University of Da Nang, University of Science and Education, 459 Ton Duc Thang street, Lien Chieu district, Da Nang city, Vietnam, 2 The University of Da Nang, University of Science and Technology, 54 Nguyen Luong Bang street, Lien Chieu district, Da Nang city, Vietnam, 3 The University of Da Nang, VN-UK Institute for Research & Executive Education, 158A Le Loi street, Hai Chau district, Da Nang city, Vietnam, ddbinh@ued.udn.vn, pmtuan@dut.udn.vn, long.dang@vnuk.edu.vn ABSTRACT: Many methods can be used to predict the secondary structure of an RNA molecule. One of the methods is the dynamic programming approach. However, the dynamic programming approach usually takes too much time. Thus, it is not very practical to solve the problem of long sequences with dynamic programming. In this paper, we propose a novel RNA secondary structure prediction algorithm using a neural network model combined with genetics algorithms to improve the accuracy with large- scale RNA sequence and structure data. We analyze current experimental RNA sequences and structure data to construct a deep network model, and then we extract implicit features of an effective classification from large-scale data to predict the pairing probability of each base in an RNA sequence. For the obtained probabilities of RNA sequence base pairing, an enhanced genetic algorithm is applied to obtain the optimal RNA secondary structure. Results indicate that our proposed method is superior to the common RNA secondary structure prediction algorithms. Based on the characteristics of deep learning algorithm, it can be inferred that the method proposed in this paper has higher prediction success rate when compared with other algorithms, which will be needed as the amount of real RNA structure data increases in the future. Keywords: Neural Network, Genetic Algorithm; Machine Learning; RNA Secondary Structure; Base Pairing; Minimum Free Energy; Long Short-Term Memory. I. INTRODUCTION RNA molecules are integral components of the cellular machinery for protein synthesis and transport, transcriptional regulation, chromosome replication, RNA processing and modification, and other fundamental biological functions [1], [2]. RNA secondary structure is represented by a list of the nucleotide bases paired by hydrogen bonding within its nucleotide sequence. Studying the relationship between RNA function and structure and determining the form and frequency of RNA folding are important to reveal the role of RNA molecules in the life process [3], [4]. Secondary structure can be determined directly by x-ray diffraction, but this is difficult, slow, and expensive. Moreover, it is currently impossible to crystallize most RNAs. Mathematical models for prediction have therefore been developed and these have led to serial (and some parallel) computer algorithms, but these too are expensive in terms of computation time. This macromolecule is basically composed of four fundamental molecules i.e., Adenine (A), Cytosine (C), Guanine (G) and Uracil (U). The molecules are same as that of DNA accept Uracil. DNA has Thymine (T) instead of Uracil (U). Another structural difference is that DNA is double stranded, however in most cases, RNA is single stranded. In the presence of salty water, RNA forms intra strand base-pairs, which result in the formation of secondary structure. Under appropriate conditions, the secondary structure folds back around itself to form tertiary structure of RNA. This folding process usually depends on the presence of divalent ions like magnesium ions and on the temperature. Until now, much progress has been made in the computational simulation of RNA secondary structure prediction. Dynamic programming is one of the old and widely accepted techniques. This method of secondary structure prediction was first proposed by Waterman [5], Waterman and Smith [6] and Nussinov [7]. The drawback of this method of prediction is its computational time. The behavior of dynamic programming algorithms is found to be of Ø(n 4 ), which is too slow to be effective, for bigger sequences, since the behavior is exponential. Several attempts to modify the dynamic programming algorithms have been made and considered to be successful. Another method of determining the secondary structure of RNA is the comparative method, which works simultaneously with more than one sequence in order find an identical structure. Sanko [8] extended the dynamic programming approach by folding and aligning multiple sequences to generate a phylogenetic tree for secondary structure prediction. The Zuker algorithm, implemented in the programs MFOLD [9] and ViennaRNA [10], is an efficient dynamic programming algorithm for identifying the globally minimal energy structure for a sequence, as defined by such a thermodynamic model [11], [12]. The Zuker algorithm requires Ø(n 3 ), time and Ø(n 2 ), space for a sequence of length N. Unfortunately, these methods can predict only the structure of an RNA sequence with length no more than 200 in acceptable time. Corpet and Michot designed a heuristic algorithm to identify which portions of two sequences can be aligned without the structure information, and others portions are aligned by using a specialized dynamic programming algorithm [13]. This method cannot predict structures with pseudoknots. Notredame et al. used a genetic algorithm (GA)