Bayesian Variable Selection for Linear Regression in High Dimensional Microarray Data Wellington Cabrera University of Houston Houston, TX 77204, USA Carlos Ordonez University of Houston Houston, TX 77204, USA David Sergio Matusevich University of Houston Houston, TX 77204, USA Veerabhadran Baladandayuthapani MD Anderson Cancer Center University of Texas Houston, TX 77030, USA ABSTRACT Variable selection is a fundamental problem in Bayesian statistics whose solution requires exploring a combinatorial search space. We study the solution of variable selection with a well-known MCMC method, which requires thou- sands of iterations. We present several algorithmic opti- mizations to accelerate the MCMC method to make it work eﬃciently inside a database system. Our optimizations in- clude suﬃcient statistics, variable preselection, hash tables and calling a linear algebra library. We present experiments with very high dimensional microarray data sets to predict cancer survival time. We discuss encouraging ﬁndings, iden- tifying speciﬁc genes likely to predict the survival time for brain cancer patients. We also show our DBMS-based al- gorithm is orders of magnitude faster than the R statistical package. Our work shows a DBMS is a promising platform to analyze microarray data. 1. INTRODUCTION DBMSs act as a repository of biomedical data sets, includ- ing gene expression data sets. In this work we address the problem of of Bayesian variable selection for linear regression in microarray data sets using a DBMS. Beneﬁts of data anal- ysis inside a DBMS include fast data access speed, ﬂexible querying and increased data security. Moreover, the data sets used in our project were already stored on a DBMS. Variable selection, the search of best subsets of variables predicting a target variable, is a signﬁcantly hard problem when the number of dimensions is high (thousands). Since the search space is large, a brute force search approach is in- feasible. Instead, a promising approach from modern statis- tics is to solve this problem through a Bayesian approach using Markov Chain Monte Carlo (MCMC) methods [1, 5]. In our study, an optimized algorithm is applied to identify the variable subsets that are best predictors of survival time of brain cancer patients. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage, and that copies bear this notice and the full ci- tation on the ﬁrst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). November 1, 2013, San Francisco, CA, USA. 2. SUMMARY Linear regression is a statistical model that describes a linear relationship between a scalar dependent variable and a set of explanatory variables (or independent variables ). We focus on the case when the number of explanatory variables, p, is very high (thousands) and the number of data points, n, is small (a few hundred). There are two main steps in our algorithm: Variable preselection and an optimized Gibbs sampler. We integrate both steps into the DBMS through User Deﬁned Functions (UDFs). Variable preselection is an eﬀective strategy for dealing with high dimensionality problems [2, 3]. In order to reduce the size of the original problem we process the initial data set, preselecting d out of the p variables that are most correlated to the dependent variable (n<d<p). The resulting reduced data set is loaded into RAM. The second step of the algorithm is a Gibbs sampler with several optimizations. This is an MCMC method that pro- duces an approximation of the posterior probabilities of the model parameters, based on an informative Zellner’s G prior [4]. We also introduce a prior on the vector of the k selected variables, favoring parsimonious models (1 <k ≪ n). We use the data summarizations described in [6] to accelerate the posterior calculations. This process is further acceler- ated by noticing that many models are frequently repeated, therefore storing the computed probabilities on a hash table saves a considerable amount of time. Further acceleration is achieved by discarding low frequency variables after the burn in period, that is, variables that appear less than a pre- determined number of times in the selected models. We also integrate LAPACK into the DBMS to improve the accuracy and performance of matrix inversions. Experimental results show consistent marginal probabili- ties for the most frequent variables across experiments. Our algorithm found some of the top markers (dimensions) that had been previously described in the biomedical literature as important in determining survival time of cancer patients [7, 8]. When comparing performance with the R package, our algorithm is up to 100 times faster, depending on d. Our algorithm performs each iteration in the worst case time complexity T = O(dnk 2 )+ O(dk 3 ), but the average case is much better. We have successfully integrated this optimized algorithm with SQL queries and UDFs.