QLZCClust: Quaternary Lempel-Ziv Complexity based Clustering of the RNA-seq Read Block Segments Ashis Kumer Biswas, Baoju Zhang, Xiaoyong Wu and Jean X. Gao Abstract— The Next Generation Sequencing platform, RNA- seq provides quantitative expression data that exhibit distinctive sequence patterns in the segments of the short-reads level and are found useful in clustering of those segments. However, the result does not reflect the functional chemistry of the non- coding RNAs (ncRNAs). The functions of the ncRNAs are deeply related to their secondary structures. Thus by exploring the clustering in terms of structural profiles of the read block segments rather than their sequence patterns would be essential and useful. We proposed the QLZCClust (Quaternary Lempel- Ziv complexity based Clustering) method which is an extension to the popular Lempel-Ziv algorithm to compute pairwise secondary structure distance. We applied QLZCClust on the short-read segments obtained from the RNA-seq experient and found that it can separate most miRNAs and the tRNAs. Moreover, it can be used to detect structural similarities among different classes of ncRNAs. We compared our algorithm with the clustering of two other structural distance measures – SimTree edit distance and RNAz based distance, and found that our method performs superior. I. I NTRODUCTION RNA-seq is a revolutionary technology for profiling tran- scriptomes with which a very precise measurement of ex- pression levels of transcripts and their isoforms can be accomplished [1]. The expression data reveal a vast number of possibilities to look into the transcriptomic details of an organism that may or may not have a well-studied genome. In most cases the reference genome is available and the analysis pipeline starts by mapping the RNA-seq short-read sequences to the genome [2]. In this study we limited our focus on the RNA-seq experiments that dealt with the microRNAs, the central topic for many therapeutic researches. The microRNA like small non-coding RNAs are produced from microRNA precursors, other structured RNAs and the dicers [3], [4]. The diversity of processing pathways urges the researchers exploring how the short read patterns in RNA-seq datasets relate to the processing of particular non-coding RNAs. For instance, the characteristic mutual positioning with a 3’- overhang of miR and miR* products that is a characteristic feature for dicer cleavage, the anomalous 5’-overhang ob- served for some microRNAs resulting from a distinct, dicer- dependent two-step mechanism, and the dicer-independent processing of mir-451 [5]. Therefore, we seek for profiles to represent distinct pathways. Ashis Kumer Biswas and Jean X. Gao are with Department of Computer Science and Engineering, University of Texas at Arlington, Texas 76019, The United States of America ashis.biswas@mavs.uta.edu; gao@uta.edu Baoju Zhang and Xiaoyong Wu are with School of Physics and Electronic Information, Tianjin Normal University, Tianjin, 300387, China There are several studies on the identification of the profiles. The deepBlockAlign [5] introduced a two-step approach to align RNA-seq read patterns with the aim of quickly identifying RNAs that share similar processing foot- prints. Overlapping mapped reads are first merged to blocks and then closely spaced blocks are combined to block groups, each representing a locus of expression. In the second stage, block patterns are compared by means of a modified Sankoff algorithm that takes both block similarities and similarities of pattern of distances within the block groups into account. Hierarchical clustering of the block groups separated most miRNA and tRNA, and also identified about a dozen tRNAs clustering together with miRNA. However, the method only used RNA-seq short read sequence patterns in both the stages and did not consider any similarity measure in terms of the secondary structure. But, this structural similarity measurement is important because most of the RNAs in the dataset are structural. We, therefore, introduce QLZCClust, a hierarchical clustering of the same block groups, but instead of doing the sequence based distance calculation of the segment pairs, we propose a new pairwise structural distance measure – quaternary Lempel-Ziv complexity which is able to enhance the clustering results in the structural domain. In the following section, we describe in detail the proposed QLZCClust scheme along with dataset preparation step. In section III we summarize our experiment results and com- paring our performance with clustering results by different secondary structural distance metrics. Finally, in section IV we conclude our manuscript with a guide to future research direction. II. METHODS A. Dataset Preparation We worked with the same benchmark Illumina se- quencing datasets used by “deepBlockAlign” [5] au- thors which is the Human eb dataset [6]. At first, the short-read sequences from the RNA-seq experiment were mapped onto the reference genome using “segemehl” [7]. The mapped reads are then divided into blocks of consecutive reads using the “blockbuster” tool [8] (with parameters: distance=30, minBlockHeight=1, minClusterHeight=50, scale=0.5). The expression filter [5] was applied on the read block groups (segments) to keep only those segments having more than one block, at least 50 reads per group and the size range between 50 nt and 200 nt. At the end, 455 RNA read block segments remained. 978-1-4799-3163-7/13/$31.00 ©2013 IEEE