Estimating the Expression of Transcript Isoforms from
mRNA-Seq via Nonnegative Least Squares
Hyunsoo Kim, Yingtao Bi, and Ramana V. Davuluri
Center for Systems and Computational Biology
The Wistar Institute
3601 Spruce Street, Philadelphia, PA 19104-4268
hkim@wistar.org, ybi@wistar.org, rdavuluri@wistar.org
Abstract—mRNA-Seq is an emerging massive parallel sequencing
based technology for identification and quantification of gene
transcripts. Although gene level expression can easily be
estimated from mRNA-Seq data, estimating the isoform level
expression poses serious problems, and appropriate methods are
required. In this paper, we introduce a mathematical method to
estimate transcript isoform concentrations, i.e. transcript isoform
expression estimation via nonnegative least squares (TIEE/NLS).
Keywords-component; transcript isoform; mRNA-Seq;
transcript isoform expression estimation; nonnegative least
squares;
I. INTRODUCTION
Gene-oriented study has been used for many interesting
biomedical studies including cancer biology, developmental
biology, and stem cell research. Recently, it has been found
from mRNA-Seq data of various tissues and cell-lines that it is
common for a gene to have multiple isoforms, due to
alternative splicing. Also, alternative transcription initiation
and/or termination can generate different transcript isoforms.
Since skipping exons may result in the loss of protein function,
different transcript isoforms induce different biological
processes. Thus, transcript isoform-oriented study is desirable.
For the transcript isoform-oriented study, estimating transcript
isoform concentrations in samples is the first step to do.
However, this is not easy task to perform from mRNA-Seq
since the tag counts for specific genome locations (exons) may
usually correspond to different transcript isoforms. A few
recent publications have addressed this problem [1, 2]. In this
paper, we introduce an alternative approach for gene
expression estimation at isoform level.
II. MRNA-SEQ
A. Gene-oriented mRNA-Seq Data Analyses
mRNA-Seq method has higher resolution for
characterizing the profile of the expression of genes and their
isoforms than the usual microarray methods [3, 4]. For gene-
oriented mRNA-Seq data analyses, the output is usually the
expression level of a gene such as reads per kilobase of exon
per million mapped reads (RPKM) to the transcriptome [5].
RPKM is simply given by:
RPKM = ͳͲ
9
ܥ/ሺܮሻ, (1)
where ܥis the number of mappable reads that fell onto the
gene's exons, is the total number of mappable reads in the
experiment, and ܮis the total length of the exons.
Here, we propose using optimization methods that have
been well established especially for treating nonnegative
constraints. One of the simplest examples is nonnegative least
squares.
III. TRANSCRIPT ISOFORM EXPRESSION ESTIMATION VIA
NONNEGATIVE LEAST SQUARES (TIEE/NLS)
We consider known transcript isoform clusters. Each
transcript isoform cluster has a set of transcript isoforms for a
gene. Also, we consider transcript isoforms of other genes
overlapped with them for more accurate transcript isoform
expression estimation. So, there could be more than one gene in
the model. Since transcript isoforms can share only part of an
exon, we use nonoverlapped slices of exons. Suppose we have
transcript isoforms and exon slices. After getting RPKM
for each exon slice, we solve a nonnegative least squares
problem to estimate the transcript isoform contributions to
RPKMs of observed exon reads. The observed RPKM of the j-
th exon slice, i.e.
for ͳ , can be represented as,
= ∑ ݔ
⋅
=ͳ
s. t. ݔ
Ͳ, (2)
where ݔ
is the transcript isoform expression estimation and
is 1 when the i-th transcript isoform has the j-th exon slice,
otherwise
is 0. Detailed algorithm is described below.
A. Algorithm
Repeat followings for each transcript isoform cluster
1. Get genomic locations of transcript isoforms in the
transcript isoform cluster and compute the genomic
range that covers transcript isoforms.
2. Find all transcript isoforms overlapped with the
genomic range.
3. Collect exons used in found transcript isoforms. When
only part of an exon is overlapped, the exon is split into
exclusive splices of exons.
4. Build a matrix ܥwhose element
is 1 when the i-th
transcript isoform has the j-th exon slice, otherwise
is 0.
This work has been performed based on RSG-07-097-01-MGO from
American Cancer Society.
2010 IEEE International Conference on Bioinformatics and Bioengineering
978-0-7695-4083-2/10 $26.00 © 2010 IEEE
DOI 10.1109/BIBE.2010.61
296