Volume 8 • Issue 1 • 1000332 J Biom Biostat, an open access journal
ISSN: 2155-6180
Research Article
Li et al., J Biom Biostat 2017, 8:1
DOI: 10.4172/2155-6180.1000332
Research Article Open Access
Journal of Biometrics & Biostatistics
J
o
u
r
n
a
l
o
f
B
i
o
m
e
t
r
i
c
s
&
B
i
o
s
t
a
t
i
s
t
i
c
s
ISSN: 2155-6180
Keywords: FDR; Sample size; Wald test; Exact test; Likelihood ratio
test
Introduction
Sample size calculations are a prerequisite in an experimental
design for biological research and clinical trials. Recently, high-
throughput RNA sequencing (RNA-seq) technology has been widely
used for gene expression studies in a variety of applications, such
as expression profling of mRNAs or non-coding RNA [1-3], de
novo assembly and characterization of transcriptomes [4,5], and the
identifcation of novel alternatively spliced transcript [6,7]. Tese
novel transcripts or diferentially expressed genes (DEGs) identifed
from RNA-seq data may serve as human disease biomarkers or gene
signatures for the clinical diagnosis [8-10]. With the rapid growth of
RNA-seq applications, sample size calculation methods derived from
test statistics with an appropriate distribution are important issues to
be explored and discussed.
Due to the initial high cost of RNA-seq, sample size, in terms of the
number of biological replicates, was not seriously considered as part of
the experimental design. As a case in point, one RNA-seq review article
[11] documented several RNA-seq studies showing that many had
only one or a few biological replicates. While thousands of DEGs were
identifed within these studies, the lack of biological replicates leads
to an absence of knowledge concerning biological variations and may
result in a high percentage of false positive genes. Terefore, ignorance
of biological variation is the fundamental problem with the analyzed
results collected from un-replicated data. A recent paper [12] was the
frst to point out that conclusions drawn from un-replicated samples
can be misleading and unrealistic. Later, another study [13] further
addressed design and validation issues due to the lack of biological
replicates in RNA-seq data.
One of the key questions in an experimental design is to determine
the number of biological replicates needed for diferential expression
analysis in order to achieve a desired statistical power given a
signifcance level α and an underlying distribution. Since RNA-seq
data are read counts, a Poisson distribution is commonly used as the
model for identifying DEGs in RNA-seq data [14,15]. Fang and Cui
were the frst one to derive the sample size calculations based on a Wald
statistical test with a Poisson distribution for single gene in RNA-seq.
*Corresponding author: Rai SN, Department of Bioinformatics and Biostatistics,
University of Louisville Louisville, KY, 40202, USA, Tel: 1-502-852-4030; E-mail:
Shesh.Rai@louisville.edu
Received December 21, 2016; Accepted January 25, 2017; Published January
30, 2017
Citation: Li X, Cooper NGF, Shyr Y, Wu D, Rouchka EC, et al. (2017) Inference
and Sample Size Calculations Based on Statistical Tests in a Negative Binomial
Distribution for Differential Gene Expression in RNA-seq Data. J Biom Biostat 8:
332. doi:10.4172/2155-6180.1000332
Copyright: © 2017 Li X, et al. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and
source are credited.
Inference and Sample Size Calculations Based on Statistical Tests in a
Negative Binomial Distribution for Differential Gene Expression in RNA-
seq Data
Xiaohong Li
1,2
, Nigel GF Cooper
2
, Yu Shyr
3
, Dongfeng Wu
1
, Eric C Rouchka
4
, Ryan S Gill
5
, Timothy E O’Toole
6
, Guy N Brock
1,7
and Shesh N
Rai
1
,*
1
Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA
2
Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, 40202, USA
3
Department of Biostatistics, Vanderbilt University, Nashville, TN, 37232, USA
4
Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY, 40292, USA
5
Department of Mathematics, University of Louisville, Louisville, KY, 40292, USA
6
Department of Cardiology, University of Louisville, Louisville, KY, 40202, USA
7
Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA
Abstract
The high throughput RNA sequencing (RNA-seq) technology has become the popular method of choice for
transcriptomics and the detection of differentially expressed genes. Sample size calculations for RNA-seq experimental
design are an important consideration in biological research and clinical trials. Currently, the sample size formulas
derived from the Wald and the likelihood ratio statistical tests with a Poisson distribution to model RNA-seq data have
been developed. However, since the mean read counts in the real RNA-seq data are not equal to the variance, an
extended method to calculate sample sizes based on a negative binomial distribution using an exact test statistic
was proposed by Li et al. in 2013. In this study, we alternatively derive fve sample size calculation methods based on
the negative binomial distribution using the Wald test, the log-transformed Wald test and the log-likelihood ratio test
statistics. A comparison of our fve methods and an existing method was performed by calculating the sample sizes and
the simulated power in different scenarios. We frst calculated the sample sizes for testing a single gene using the six
methods given a nominal signifcance level α at 0.05 and 80% power. Then, we calculated the sample sizes for testing
multiple genes given a false discovery rate (FDR) at 0.05 and 0.10. The empirical power and true prognostic genes
for differential gene expression analysis corresponding to the estimated sample sizes from the six methods are also
estimated via the simulation studies. Using the sample size formulas derived from log-transformed and Wald-based
tests, we observed smaller sample properties while maintaining the nominal power close to or higher than 80% in all
the settings compared to other methods. Moreover, the Wald test based sample size calculation method is easier to
compute and faster in an RNA-seq experimental design.