BIOINFORMATICS
Vol. 20 Suppl. 1 2004, pages i31–i39
DOI: 10.1093/bioinformatics/bth924
Statistical modeling of sequencing errors in
SAGE libraries
Tim Beißbarth
1, *
, Lavinia Hyde
1
, Gordon K. Smyth
1
, Chris Job
2
,
Wee-Ming Boon
2
, Seong-Seng Tan
2
, Hamish S. Scott
1
and
Terence P. Speed
1
1
Walter and Eliza Hall Institute of Medical Research, Genetics and Bioinformatics,
1G Royal Parade, Parkville, Vic 3050, Australia and
2
Howard Florey Institute, Brain
Development Laboratory, University of Melbourne, Parkville, Vic 3010, Australia
Received on January 15, 2004; accepted on March 1, 2004
ABSTRACT
Motivation: Sequencing errors may bias the gene expression
measurements made by Serial Analysis of Gene Expression
(SAGE). They may introduce non-existent tags at low abund-
ance and decrease the real abundance of other tags. These
effects are increased in the longer tags generated in Long-
SAGE libraries. Current sequencing technology generates
quite accurate estimates of sequencing error rates. Here we
make use of the sequence neighborhood of SAGE tags and
error estimates from the base-calling software to correct for
such errors.
Results: We introduce a statistical model for the propagation
of sequencing errors in SAGE and suggest an Expectation-
Maximization (EM) algorithm to correct for them given
observed sequences in a library and base-calling error estim-
ates. We tested our method using simulated and experimental
SAGE libraries. When comparing SAGE libraries, we found
that sequencing errors can introduce considerable bias. High
abundance tags may be falsely called as significantly differ-
entially expressed, especially when comparing libraries with
different levels of sequencing errors and/or of different size.
Truly, differentially expressed tags have decreased signific-
ance as ‘true’-tag counts are generally underestimated. This
may alter if tags near the threshold of differential expres-
sion are called significant. Moreover, the number of different
transcripts present in a library is overestimated as false tags
are introduced at low abundance. Our correction method
adjusts the tag counts to be closer to the true counts and
is able to partly correct for biases introduced by sequencing
errors.
Availability: An implementation using R is distributed
as an R package. An online version is available at
http://tagcalling.mbgproject.org
Contact: beissbarth@wehi.edu.au
∗
To whom correspondence should be addressed.
1 INTRODUCTION
Serial Analysis of Gene Expression (SAGE) is a gene
expression profiling technique that estimates the abundance
of thousands of gene transcripts (mRNAs) from a cell or tissue
sample in parallel (Velculescu et al., 1995). SAGE is based
on the sequencing of short sequence tags that are extracted
at defined positions of the transcript. As opposed to DNA
microarray technology (Schena et al., 1995; Lockhart et al.,
1996), SAGE does not require prior knowledge of the tran-
scripts, and results in an estimate of the absolute abundance
of a transcript. However, due to sequencing errors a propor-
tion of the low-abundance tags do not represent real genes
altering the ability of SAGE to estimate the number of tran-
scripts that have been observed. Moreover, loss of ‘true’-tags
due to sequencing errors will result in altered numbers for the
abundance of genuine transcripts.
Stollberg et al. (2000) have studied the effects of various
sources of errors on SAGE results by simulating libraries. Pre-
viously sequencing errors have been minimized by removing
low-abundance tags or tags with low sequence quality from
the libraries (Margulies and Innis, 2000). Velculescu et al.
(1999) first attempted to join low-abundance tags to their
neighborhood. A more refined approach, that assumes con-
stant error probabilities and uses matrix inversion to correct for
sequencing errors, has been introduced by Colinge and Feger
(2001). Another recently developed approach by Blades et al.
(2004a,b) uses a linear relation between the copy number of
observed tags and the number of neighbors with one-base sub-
stitutions to estimate the average rate of sequencing errors and
eliminate unreliable tags. Akmaev and Wang (2004) use such
error estimates to correct for sequencing errors and PCR based
artifacts. They estimate that in LongSAGE libraries 3.5% of
the tag sequences have errors resulting from PCR artifacts
and 17.3% of the tag sequences have errors resulting from
sequencing errors. These approaches, however, do not take
into account estimates for sequencing errors of the individual
bases.
Bioinformatics 20(Suppl. 1) © Oxford University Press 2004; all rights reserved. i31
by guest on March 9, 2016 http://bioinformatics.oxfordjournals.org/ Downloaded from