QLZCClust: Quaternary Lempel-Ziv Complexity based Clustering of
the RNA-seq Read Block Segments
Ashis Kumer Biswas, Baoju Zhang, Xiaoyong Wu and Jean X. Gao
Abstract— The Next Generation Sequencing platform, RNA-
seq provides quantitative expression data that exhibit distinctive
sequence patterns in the segments of the short-reads level and
are found useful in clustering of those segments. However, the
result does not reflect the functional chemistry of the non-
coding RNAs (ncRNAs). The functions of the ncRNAs are
deeply related to their secondary structures. Thus by exploring
the clustering in terms of structural profiles of the read block
segments rather than their sequence patterns would be essential
and useful. We proposed the QLZCClust (Quaternary Lempel-
Ziv complexity based Clustering) method which is an extension
to the popular Lempel-Ziv algorithm to compute pairwise
secondary structure distance. We applied QLZCClust on the
short-read segments obtained from the RNA-seq experient and
found that it can separate most miRNAs and the tRNAs.
Moreover, it can be used to detect structural similarities among
different classes of ncRNAs. We compared our algorithm with
the clustering of two other structural distance measures –
SimTree edit distance and RNAz based distance, and found
that our method performs superior.
I. I NTRODUCTION
RNA-seq is a revolutionary technology for profiling tran-
scriptomes with which a very precise measurement of ex-
pression levels of transcripts and their isoforms can be
accomplished [1]. The expression data reveal a vast number
of possibilities to look into the transcriptomic details of an
organism that may or may not have a well-studied genome. In
most cases the reference genome is available and the analysis
pipeline starts by mapping the RNA-seq short-read sequences
to the genome [2]. In this study we limited our focus on the
RNA-seq experiments that dealt with the microRNAs, the
central topic for many therapeutic researches. The microRNA
like small non-coding RNAs are produced from microRNA
precursors, other structured RNAs and the dicers [3], [4].
The diversity of processing pathways urges the researchers
exploring how the short read patterns in RNA-seq datasets
relate to the processing of particular non-coding RNAs. For
instance, the characteristic mutual positioning with a 3’-
overhang of miR and miR* products that is a characteristic
feature for dicer cleavage, the anomalous 5’-overhang ob-
served for some microRNAs resulting from a distinct, dicer-
dependent two-step mechanism, and the dicer-independent
processing of mir-451 [5]. Therefore, we seek for profiles to
represent distinct pathways.
Ashis Kumer Biswas and Jean X. Gao are with Department of Computer
Science and Engineering, University of Texas at Arlington, Texas 76019,
The United States of America ashis.biswas@mavs.uta.edu;
gao@uta.edu
Baoju Zhang and Xiaoyong Wu are with School of Physics and Electronic
Information, Tianjin Normal University, Tianjin, 300387, China
There are several studies on the identification of the
profiles. The deepBlockAlign [5] introduced a two-step
approach to align RNA-seq read patterns with the aim of
quickly identifying RNAs that share similar processing foot-
prints. Overlapping mapped reads are first merged to blocks
and then closely spaced blocks are combined to block groups,
each representing a locus of expression. In the second stage,
block patterns are compared by means of a modified Sankoff
algorithm that takes both block similarities and similarities
of pattern of distances within the block groups into account.
Hierarchical clustering of the block groups separated most
miRNA and tRNA, and also identified about a dozen tRNAs
clustering together with miRNA. However, the method only
used RNA-seq short read sequence patterns in both the
stages and did not consider any similarity measure in terms
of the secondary structure. But, this structural similarity
measurement is important because most of the RNAs in the
dataset are structural. We, therefore, introduce QLZCClust, a
hierarchical clustering of the same block groups, but instead
of doing the sequence based distance calculation of the
segment pairs, we propose a new pairwise structural distance
measure – quaternary Lempel-Ziv complexity which is able
to enhance the clustering results in the structural domain.
In the following section, we describe in detail the proposed
QLZCClust scheme along with dataset preparation step. In
section III we summarize our experiment results and com-
paring our performance with clustering results by different
secondary structural distance metrics. Finally, in section IV
we conclude our manuscript with a guide to future research
direction.
II. METHODS
A. Dataset Preparation
We worked with the same benchmark Illumina se-
quencing datasets used by “deepBlockAlign” [5] au-
thors which is the Human eb dataset [6]. At first, the
short-read sequences from the RNA-seq experiment were
mapped onto the reference genome using “segemehl”
[7]. The mapped reads are then divided into blocks
of consecutive reads using the “blockbuster” tool [8]
(with parameters: distance=30, minBlockHeight=1,
minClusterHeight=50, scale=0.5). The expression
filter [5] was applied on the read block groups (segments)
to keep only those segments having more than one block,
at least 50 reads per group and the size range between 50
nt and 200 nt. At the end, 455 RNA read block segments
remained.
978-1-4799-3163-7/13/$31.00 ©2013 IEEE