A. Dobra et al. Graphical model-based gene clustering and metagene expression analysis Adrian Dobra, Quanli Wang and Mike West Duke University, Durham, NC 27708, USA ABSTRACT Summary: We describe a novel gene expression analysis method for the creation of overlapping gene clusters and as- sociated metagene signatures that aim to characterize the dominant common expression patterns within each cluster. The analysis is based on the use of statistical graphical mod- els to identify and estimate patterns of association among gene subsets from gene expression data, and then clustering is based formal estimates of very sparse covariance matri- ces arising from these models. Metagene summaries, which are of interest as reduced dimensional summaries for phe- notyping studies, are simply the resulting model-based es- timates of dominant singular factors (principal components) of population variance matrices within resulting overlapping clusters. We describe connections between graph-theoretic approaches to exploring gene expression graphical models and exploration in biological contexts of gene subsets rep- resented by identified metagenes, illustrating some aspects of the utility of this framework for summary representation of observational gene expression data. Availability: The software implementing our method is called MetageneCreator and is available for download at http://www.stat.duke.edu/∼adobra/metagenecreator.htm Supplemental information: http://www.stat.duke.edu/ ∼adobra/ermeta.zip Contact: adobra@stat.duke.edu INTRODUCTION In a number of gene expression studies, the utility of mul- tivariate statistical methods to define and estimate aggre- gate, common patterns underlying groups of genes has been demonstrated. Various clustering methods are in common use to define groups of genes, and data reduction methods to de- fine weighted averages of expression of co-clustered genes can both reduce dimension and improve signal resolution in relation to predicting a phenotype that is inherently related to multiple co-expression or co-regulated genes. Singular value decomposition (principal component) methods are standard tools, and underlie methods for expression data summary, re- duction and characterization as well as the use of such aggre- gate summaries as predictors of defined clinical or physiolog- ical phenotypes. Some key examples include the eigengenes of Alter et al. (2000, 2003), and the metagenes of West et al. (2001); Huang et al. (2003a,b); Pittman et al. (2004). The latter authors focus on the use of clustering methods, as pop- ularized by Eisen et al. (1998) for example, to define multi- ple clusters with a view to reduce dimension while hopefully maintaining a representation of multiple common aspects of variation in gene expression across samples through weighted averages defined as the dominant singular factors within each cluster. There are many possible variations on this kind of appli- cation of standard statistical clustering and dimension reduc- tion. Our interest here lies in three aspects: first, the devel- opment of refined methods of clustering to ensure that the ag- gregate dominant singular factor does indeed represent a com- mon pattern underlying a group of genes that show reasonable co-expression patterns; second, the enrichment of gene sub- sets defining clusters using estimated patterns of association between existing clusters and individual genes; and, third, im- provement of the overall strategy utilizing improved estimates of covariance matrices of gene expression variables based on the use of Bayesian statistical graphical models. We begin with discussion with a simple and effective heuristic algorithm that, beginning with k-means clustering, constructs subsets of genes whose variation can be summa- rized by the first singular factor (principal component) within the group. The idea is simply to iteratively refine larger clus- ters to focus on smaller subsets within which genes are more and more coherently co-expressed. This is followed with dis- cussion of a method of enriching gene membership of clusters using a key but apparently novel measure of association - in covariance terms - between individual genes that are candi- dates to join a cluster and the existing group of genes within that cluster. This (and other) development of clustering and cluster enrichment of course relies on an estimate of the co- variance matrix of expression of genes. The final contribu- tions here focus on the use of sparse Bayesian graphical mod- els for improving estimation of such high-dimensional covari- ance matrices. Here we discuss issues related to choosing model and parameter priors as well as distributed computa- tional algorithms for model search. This is followed by details of how to derive model-based estimates of high-dimensional covariance matrices for use in clustering and other studies, and the broader use of such models in identifying candi- date statistical association graphs - network representations of gene expression data that are of value in visualizing the empirical associations in new and sometimes insightful ways. The paper concludes with an example from breast cancer ge- nomics and summary comments. 1