Reproducibility-Optimized Test Statistic for Ranking Genes in Microarray Studies Laura L. Elo, Sanna File ´ n, Riitta Lahesmaa, and Tero Aittokallio Abstract—A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. Although previous studies on simulated or spike-in data sets do not provide practical guidance on how to choose the best method for a given real data set, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene-ranking statistic directly from the data. In comparison with existing ranking methods, the reproducibility-optimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in data set. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given data set without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibility-optimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well. Index Terms—Microarray, gene expression, gene ranking, reproducibility, differential expression, bootstrap. Ç 1 INTRODUCTION T HE identification of a set of candidate genes that are differentially expressed (DE) between distinct groups of samples is still a fundamental research problem in most microarray studies. A wide range of algorithms for ranking genes in terms of their differential expression is currently available, ranging from simple statistics such as fold-change or signal log-ratio (slr) to penalized t-tests such as the SAM procedure [22]. However, no general consensus exists as to standard for microarray data analysis protocol [2]. Although methodologists are increasingly developing new statistical tests for differential expression together with sophisticated procedures for controlling or estimating multiple testing error rates, either familywise or false discovery rate (FDR), only limited efforts have been devoted to providing microarray users with well-grounded guidance on how to choose suitable procedures for a given data set [2]. Although most computational algorithms seem to work reasonably well under some circumstances, in particular when sample sizes are relatively large, their performance can vary drastically depending on the properties of the data such as the number of genes probed and their distributional characteristics, as well as the number of replications available [3], [12], [16]. As confirmation studies on candidate genes are expensive and time consuming, techniques to assess in advance the behavior of different ranking approaches would be of great practical and economical significance. Simulation studies have been a popular option to compare different gene ranking algorithms, for example, in terms of their receiver operating characteristics (ROC) [13]. Because the data properties are completely known a priori, simulation studies allow for unequivocal evalua- tion of the concordance between the underlying truth and the ranking results obtained [15]. Moreover, once a simulation model is established, one can generate multiple data sets and analyze the behavior of the algorithms as a function of a multitude of factors, such as sample and effect sizes or biological and technical variability. Simula- tions from a broad variety of distributions can reasonably well represent the actual microarray measurements, de- spite the limited knowledge of the properties of individual mRNA levels or their complex relationships [15]. Alter- natively, several authors have used real data sets whose true underlying characteristics are known. In particular, spike-in studies, where the genes and groups are char- acterized by known quantities of mRNA, have extensively been applied to evaluate computational steps involved in microarray data analysis. For instance, the AffyComp benchmark provides a means of comparing the number of available preprocessing methods for estimating Affyme- trix GeneChip expression measurements [6], [10]. How- ever, their relatively small spectrum and the lack of spike- in experiments for biological replicates or cDNA platforms may limit the generalizability of the results from the current spike-in studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 3, JULY-SEPTEMBER 2008 423 . L.L. Elo is with the Department of Mathematics, University of Turku, Turku, Finland, and the Turku Centre for Biotechnology, Turku, Finland. E-mail: laliel@utu.fi. . S. File´n and R. Lahesmaa are with the Turku Centre for Biotechnology, Turku, Finland. E-mail: {sanna.filen, riitta.lahesmaa}@btk.fi. . T. Aittokallio is with the Department of Mathematics, University of Turku, Turku, Finland, the Turku Centre for Biotechnology, Turku, Finland, and the Systems Biology Unit, Institut Pasteur, 25-28 Rue du Dr. Roux, 75724 Paris Cedex. E-mail: teanai@utu.fi, teanai@pasteur.fr. Manuscript received 2 Oct. 2006; revised 21 Feb. 2007; accepted 16 Apr. 2007; published online 1 May 2007. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-0182-1006. Digital Object Identifier no. 10.1109/TCBB.2007.1078 1545-5963/08/$25.00 ß 2008 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM