Volume 8 • Issue 1 • 1000332 J Biom Biostat, an open access journal ISSN: 2155-6180 Research Article Li et al., J Biom Biostat 2017, 8:1 DOI: 10.4172/2155-6180.1000332 Research Article Open Access Journal of Biometrics & Biostatistics J o u r n a l o f B i o m e t r i c s & B i o s t a t i s t i c s ISSN: 2155-6180 Keywords: FDR; Sample size; Wald test; Exact test; Likelihood ratio test Introduction Sample size calculations are a prerequisite in an experimental design for biological research and clinical trials. Recently, high- throughput RNA sequencing (RNA-seq) technology has been widely used for gene expression studies in a variety of applications, such as expression profling of mRNAs or non-coding RNA [1-3], de novo assembly and characterization of transcriptomes [4,5], and the identifcation of novel alternatively spliced transcript [6,7]. Tese novel transcripts or diferentially expressed genes (DEGs) identifed from RNA-seq data may serve as human disease biomarkers or gene signatures for the clinical diagnosis [8-10]. With the rapid growth of RNA-seq applications, sample size calculation methods derived from test statistics with an appropriate distribution are important issues to be explored and discussed. Due to the initial high cost of RNA-seq, sample size, in terms of the number of biological replicates, was not seriously considered as part of the experimental design. As a case in point, one RNA-seq review article [11] documented several RNA-seq studies showing that many had only one or a few biological replicates. While thousands of DEGs were identifed within these studies, the lack of biological replicates leads to an absence of knowledge concerning biological variations and may result in a high percentage of false positive genes. Terefore, ignorance of biological variation is the fundamental problem with the analyzed results collected from un-replicated data. A recent paper [12] was the frst to point out that conclusions drawn from un-replicated samples can be misleading and unrealistic. Later, another study [13] further addressed design and validation issues due to the lack of biological replicates in RNA-seq data. One of the key questions in an experimental design is to determine the number of biological replicates needed for diferential expression analysis in order to achieve a desired statistical power given a signifcance level α and an underlying distribution. Since RNA-seq data are read counts, a Poisson distribution is commonly used as the model for identifying DEGs in RNA-seq data [14,15]. Fang and Cui were the frst one to derive the sample size calculations based on a Wald statistical test with a Poisson distribution for single gene in RNA-seq. *Corresponding author: Rai SN, Department of Bioinformatics and Biostatistics, University of Louisville Louisville, KY, 40202, USA, Tel: 1-502-852-4030; E-mail: Shesh.Rai@louisville.edu Received December 21, 2016; Accepted January 25, 2017; Published January 30, 2017 Citation: Li X, Cooper NGF, Shyr Y, Wu D, Rouchka EC, et al. (2017) Inference and Sample Size Calculations Based on Statistical Tests in a Negative Binomial Distribution for Differential Gene Expression in RNA-seq Data. J Biom Biostat 8: 332. doi:10.4172/2155-6180.1000332 Copyright: © 2017 Li X, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Inference and Sample Size Calculations Based on Statistical Tests in a Negative Binomial Distribution for Differential Gene Expression in RNA- seq Data Xiaohong Li 1,2 , Nigel GF Cooper 2 , Yu Shyr 3 , Dongfeng Wu 1 , Eric C Rouchka 4 , Ryan S Gill 5 , Timothy E O’Toole 6 , Guy N Brock 1,7 and Shesh N Rai 1 ,* 1 Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA 2 Department of Anatomical Sciences and Neurobiology, University of Louisville, Louisville, KY, 40202, USA 3 Department of Biostatistics, Vanderbilt University, Nashville, TN, 37232, USA 4 Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY, 40292, USA 5 Department of Mathematics, University of Louisville, Louisville, KY, 40292, USA 6 Department of Cardiology, University of Louisville, Louisville, KY, 40202, USA 7 Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA Abstract The high throughput RNA sequencing (RNA-seq) technology has become the popular method of choice for transcriptomics and the detection of differentially expressed genes. Sample size calculations for RNA-seq experimental design are an important consideration in biological research and clinical trials. Currently, the sample size formulas derived from the Wald and the likelihood ratio statistical tests with a Poisson distribution to model RNA-seq data have been developed. However, since the mean read counts in the real RNA-seq data are not equal to the variance, an extended method to calculate sample sizes based on a negative binomial distribution using an exact test statistic was proposed by Li et al. in 2013. In this study, we alternatively derive fve sample size calculation methods based on the negative binomial distribution using the Wald test, the log-transformed Wald test and the log-likelihood ratio test statistics. A comparison of our fve methods and an existing method was performed by calculating the sample sizes and the simulated power in different scenarios. We frst calculated the sample sizes for testing a single gene using the six methods given a nominal signifcance level α at 0.05 and 80% power. Then, we calculated the sample sizes for testing multiple genes given a false discovery rate (FDR) at 0.05 and 0.10. The empirical power and true prognostic genes for differential gene expression analysis corresponding to the estimated sample sizes from the six methods are also estimated via the simulation studies. Using the sample size formulas derived from log-transformed and Wald-based tests, we observed smaller sample properties while maintaining the nominal power close to or higher than 80% in all the settings compared to other methods. Moreover, the Wald test based sample size calculation method is easier to compute and faster in an RNA-seq experimental design.