Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Original Article Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction Jael Sanyanda Wekesa a,b , Jun Meng a, ⁎ , Yushi Luan c a School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China b School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya c School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China ARTICLE INFO Keywords: Deep learning Secondary structure features lncRNA–protein interaction Prediction Plants ABSTRACT Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are func- tionally uncharacterized, thus, accurate prediction of plant lncRNA–protein interaction is imperative for sub- sequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide com- position, gapped k-mer, recursive complement and binary proﬁle. We design a multi-head self-attention long short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results, DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The proposed method shows signiﬁcant enhancement in the prediction performance compared with existing state-of- the-art methods. 1. Introduction Non-coding RNAs (ncRNAs) are regulatory molecules involved in diverse fundamental cellular processes in living organisms on a genome-wide scale. According to the central ideology of molecular biology, ncRNA transcripts lack conserved motifs and tissue speciﬁcity [1]. Long ncRNAs (lncRNAs) with the length of more than 200 nu- cleotides are a heterogeneous class of ncRNAs. lncRNAs function by harnessing their interactions with RNA binding protein (RBP) mole- cules. Besides facilitating functional mechanisms of their binding tar- gets, RBPs are also involved in the formation of ribonucleoprotein complexes and regulation of RNA fate from synthesis to decay [2]. Plant lncRNAs are transcribed by polymerase II as well as plant-speciﬁc RNA polymerase IV and V [3]. These lncRNAs help in regulating plant re- sistance to biotic and abiotic stresses, ﬂowering, lateral root develop- ment, and in the modulation of RNA stability [4]. They also play im- portant roles in cellular processes such as messenger RNA (mRNA) processing [5]. Recently, the rapid development of next-generation sequencing technologies has brought forth the avalanche of sequence data and transcriptome-wide insights into RNA-protein interaction. Based on the abundant sequence data, computational tools provide a more rapid and eﬀective way of predicting RNA-protein interactions. The interaction information is essential for annotation of lncRNAs, understanding molecular mechanisms and implication in diseases [6]. Although many lncRNAs have been discovered in plants by ex- perimental and computational techniques, their functions remain elu- sive [7]. Therefore, predicting their RBP interaction partners and their binding sites are critical for understanding their functions. Generally, the computational determination of nucleotide-amino acid interaction aﬃnities is dependent on features curated from the sequence and structural information. Sequence-based feature descriptors use se- quence composition and evolutionary information [8] whereas struc- ture-based methods exploit shape and biophysical features [9]. Feature encoding algorithms capable of capturing key characteristics of amino acid residues and nucleotides contribute to improved predictive accu- racy. However, generating the appropriate features for prediction is a diﬃcult task. Some feature engineering tools have been proposed in- cluding iLearn [10] and PyFeat [11]. The sequence-based features that have been commonly utilized by existing methods include: position- speciﬁc weight matrix (PWM), autocovariance (AC) [12], k-mer [13] and binary proﬁle features (BPF) [14,15]. PWM indicates the sig- niﬁcance of each position of the amino acids present in the protein sequence. AC is used to obtain the average correlation between a pair of residues' or nucleotides. k-mer generates sequence composition https://doi.org/10.1016/j.ygeno.2020.05.005 Received 17 December 2019; Received in revised form 22 April 2020; Accepted 5 May 2020 ⁎ Corresponding author. E-mail address: mengjun@dlut.edu.cn (J. Meng). Genomics 112 (2020) 2928–2936 Available online 11 May 2020 0888-7543/ © 2020 Elsevier Inc. All rights reserved. T