Estimating the Expression of Transcript Isoforms from mRNA-Seq via Nonnegative Least Squares Hyunsoo Kim, Yingtao Bi, and Ramana V. Davuluri Center for Systems and Computational Biology The Wistar Institute 3601 Spruce Street, Philadelphia, PA 19104-4268 hkim@wistar.org, ybi@wistar.org, rdavuluri@wistar.org Abstract—mRNA-Seq is an emerging massive parallel sequencing based technology for identification and quantification of gene transcripts. Although gene level expression can easily be estimated from mRNA-Seq data, estimating the isoform level expression poses serious problems, and appropriate methods are required. In this paper, we introduce a mathematical method to estimate transcript isoform concentrations, i.e. transcript isoform expression estimation via nonnegative least squares (TIEE/NLS). Keywords-component; transcript isoform; mRNA-Seq; transcript isoform expression estimation; nonnegative least squares; I. INTRODUCTION Gene-oriented study has been used for many interesting biomedical studies including cancer biology, developmental biology, and stem cell research. Recently, it has been found from mRNA-Seq data of various tissues and cell-lines that it is common for a gene to have multiple isoforms, due to alternative splicing. Also, alternative transcription initiation and/or termination can generate different transcript isoforms. Since skipping exons may result in the loss of protein function, different transcript isoforms induce different biological processes. Thus, transcript isoform-oriented study is desirable. For the transcript isoform-oriented study, estimating transcript isoform concentrations in samples is the first step to do. However, this is not easy task to perform from mRNA-Seq since the tag counts for specific genome locations (exons) may usually correspond to different transcript isoforms. A few recent publications have addressed this problem [1, 2]. In this paper, we introduce an alternative approach for gene expression estimation at isoform level. II. MRNA-SEQ A. Gene-oriented mRNA-Seq Data Analyses mRNA-Seq method has higher resolution for characterizing the profile of the expression of genes and their isoforms than the usual microarray methods [3, 4]. For gene- oriented mRNA-Seq data analyses, the output is usually the expression level of a gene such as reads per kilobase of exon per million mapped reads (RPKM) to the transcriptome [5]. RPKM is simply given by: RPKM = ͳͲ 9 ܥ/ሺܮሻ, (1) where ܥis the number of mappable reads that fell onto the gene's exons,  is the total number of mappable reads in the experiment, and ܮis the total length of the exons. Here, we propose using optimization methods that have been well established especially for treating nonnegative constraints. One of the simplest examples is nonnegative least squares. III. TRANSCRIPT ISOFORM EXPRESSION ESTIMATION VIA NONNEGATIVE LEAST SQUARES (TIEE/NLS) We consider known transcript isoform clusters. Each transcript isoform cluster has a set of transcript isoforms for a gene. Also, we consider transcript isoforms of other genes overlapped with them for more accurate transcript isoform expression estimation. So, there could be more than one gene in the model. Since transcript isoforms can share only part of an exon, we use nonoverlapped slices of exons. Suppose we have  transcript isoforms and  exon slices. After getting RPKM for each exon slice, we solve a nonnegative least squares problem to estimate the transcript isoform contributions to RPKMs of observed exon reads. The observed RPKM of the j- th exon slice, i.e. ݋  for ͳ ൑൑, can be represented as, ݋  = ∑ ݔ  ⋅    =ͳ s. t. ݔ  ൒ Ͳ, (2) where ݔ  is the transcript isoform expression estimation and   is 1 when the i-th transcript isoform has the j-th exon slice, otherwise   is 0. Detailed algorithm is described below. A. Algorithm Repeat followings for each transcript isoform cluster 1. Get genomic locations of transcript isoforms in the transcript isoform cluster and compute the genomic range that covers transcript isoforms. 2. Find all transcript isoforms overlapped with the genomic range. 3. Collect exons used in found transcript isoforms. When only part of an exon is overlapped, the exon is split into exclusive splices of exons. 4. Build a matrix ܥwhose element   is 1 when the i-th transcript isoform has the j-th exon slice, otherwise   is 0. This work has been performed based on RSG-07-097-01-MGO from American Cancer Society. 2010 IEEE International Conference on Bioinformatics and Bioengineering 978-0-7695-4083-2/10 $26.00 © 2010 IEEE DOI 10.1109/BIBE.2010.61 296