Contents lists available at ScienceDirect
Genomics
journal homepage: www.elsevier.com/locate/ygeno
Original Article
Multi-feature fusion for deep learning to predict plant lncRNA-protein
interaction
Jael Sanyanda Wekesa
a,b
, Jun Meng
a,
⁎
, Yushi Luan
c
a
School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China
b
School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya
c
School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China
ARTICLE INFO
Keywords:
Deep learning
Secondary structure features
lncRNA–protein interaction
Prediction
Plants
ABSTRACT
Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse
molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are func-
tionally uncharacterized, thus, accurate prediction of plant lncRNA–protein interaction is imperative for sub-
sequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by
multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide com-
position, gapped k-mer, recursive complement and binary profile. We design a multi-head self-attention long
short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results,
DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and
Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The
proposed method shows significant enhancement in the prediction performance compared with existing state-of-
the-art methods.
1. Introduction
Non-coding RNAs (ncRNAs) are regulatory molecules involved in
diverse fundamental cellular processes in living organisms on a
genome-wide scale. According to the central ideology of molecular
biology, ncRNA transcripts lack conserved motifs and tissue specificity
[1]. Long ncRNAs (lncRNAs) with the length of more than 200 nu-
cleotides are a heterogeneous class of ncRNAs. lncRNAs function by
harnessing their interactions with RNA binding protein (RBP) mole-
cules. Besides facilitating functional mechanisms of their binding tar-
gets, RBPs are also involved in the formation of ribonucleoprotein
complexes and regulation of RNA fate from synthesis to decay [2]. Plant
lncRNAs are transcribed by polymerase II as well as plant-specific RNA
polymerase IV and V [3]. These lncRNAs help in regulating plant re-
sistance to biotic and abiotic stresses, flowering, lateral root develop-
ment, and in the modulation of RNA stability [4]. They also play im-
portant roles in cellular processes such as messenger RNA (mRNA)
processing [5]. Recently, the rapid development of next-generation
sequencing technologies has brought forth the avalanche of sequence
data and transcriptome-wide insights into RNA-protein interaction.
Based on the abundant sequence data, computational tools provide a
more rapid and effective way of predicting RNA-protein interactions.
The interaction information is essential for annotation of lncRNAs,
understanding molecular mechanisms and implication in diseases [6].
Although many lncRNAs have been discovered in plants by ex-
perimental and computational techniques, their functions remain elu-
sive [7]. Therefore, predicting their RBP interaction partners and their
binding sites are critical for understanding their functions. Generally,
the computational determination of nucleotide-amino acid interaction
affinities is dependent on features curated from the sequence and
structural information. Sequence-based feature descriptors use se-
quence composition and evolutionary information [8] whereas struc-
ture-based methods exploit shape and biophysical features [9]. Feature
encoding algorithms capable of capturing key characteristics of amino
acid residues and nucleotides contribute to improved predictive accu-
racy. However, generating the appropriate features for prediction is a
difficult task. Some feature engineering tools have been proposed in-
cluding iLearn [10] and PyFeat [11]. The sequence-based features that
have been commonly utilized by existing methods include: position-
specific weight matrix (PWM), autocovariance (AC) [12], k-mer [13]
and binary profile features (BPF) [14,15]. PWM indicates the sig-
nificance of each position of the amino acids present in the protein
sequence. AC is used to obtain the average correlation between a pair of
residues' or nucleotides. k-mer generates sequence composition
https://doi.org/10.1016/j.ygeno.2020.05.005
Received 17 December 2019; Received in revised form 22 April 2020; Accepted 5 May 2020
⁎
Corresponding author.
E-mail address: mengjun@dlut.edu.cn (J. Meng).
Genomics 112 (2020) 2928–2936
Available online 11 May 2020
0888-7543/ © 2020 Elsevier Inc. All rights reserved.
T