Stability-Based Dimension Estimation of ICA with Application to Microarray Data Analysis Chen Wang 1 , Jianhua Xuan 1* , Ting Gong 1 , Robert Clarke 2 , Eric Hoffman 3 , Yue Wang 1 1 Dept. of Electrical and Computer Engineering, Virginia Tech, Arlington, VA 22203, USA 2 Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA 3 Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC 20010, USA Abstract: Independent component analysis (ICA) is a statistical decomposition method that has been applied to microarray data analysis and gene regulatory network modeling. Despite several papers have reported the effectiveness of ICA for gene expression data analysis, few of them have focused on the dimension estimation problem in ICA, i.e., how to estimate the number of independent components. Leaving the component number undecided will not only lead to dimension ambiguity in ICA, but will also produce false components or miss other underlying components, making the ICA results difficult for biological interpretation. In this paper, we propose a stability-based dimension estimation scheme for ICA, without any prior information about the number of underlying components. We first demonstrate the feasibility of the proposed scheme using simulation data, showing its prominent accuracy over its Bayesian counterpart. We then apply our dimension estimation scheme to a real gene expression data set of yeast cell cycle, showing that not only biological function enriched gene modules can be discovered, but also the independent components are consistent with the transcriptional factors estimated by Network Component Analysis (NCA). 1. Introduction As the high-throughput technology has been widely adopted in biomedical research, huge amount of microarray data is now available to bioinformatics research community [1, 2]. However, it is still a challenging task in the field of bioinformatics research to properly analyze microarray data, aiming to revealing the underlying biological mechanisms or models under study [3]. Gene regulatory network (GRN) modeling is one of the important topics in this area and there have been many approaches to reverse engineering GRN and gene module discovery [4, 5]. Among the approaches, the linear model is often used in the analysis due to its simplicity and reasonable assumption. ICA is a statistically principled linear decomposition method that models the observation as a linear combination of some independent latent variables [6]. ICA has shown its effectiveness for module discovery in many biological studies, from simple yeast model systems [7, 8, 9, 10], to complicated disease profiling [11, 12], to metabolite fingerprinting [13]. However, there is an ambiguity problem remained in the approaches, that is, the number of underlying components (or latent variables) is unresolved. Without proper selection of the component number, ICA may miss important latent components or split a component into two or more “components” incorrectly. As a result, we will have different decomposition results corresponding to different component numbers, even with the same criterion or model. The existing approaches avoid this ambiguity problem by making some unrealistic simplifications. For example, Lee et al. assumed that the component number is equal to the sample number in their ICA analyses [9]. Frigyesi et al. chose the component number by a criterion of retaining 90% energy of the eigenvalues [12]. Scholz et al. chose the number of components for ICA by maximizing a kurtosis measure [13]. Although these techniques offered us simple and heuristic solutions to the dimension estimation problem, none of them providing with any persuasive justification. Note that there are other dimension estimation methods existed, such as the Bayesian model selection [14, 15], but their assumptions are usually not suitable for microarray data analysis due to no prior information about the components available. In this paper, we propose a novel dimension estimation method based on stability analysis, which does not require any prior information about the underlying components. Specifically, we will develop two stability analysis schemes – “splitting by samples” and “splitting by genes” - for a similarity analysis of either the recovered components or mixing matrices. The dimension estimation has been applied to simulation data and a microarray data set of yeast cell cycle. The results have demonstrated the effectiveness of the proposed method with prominent accuracy in dimension estimation for obtaining biological function enriched gene modules. 2. ICA and Its Application to Microarray Data Analysis Independent component analysis (ICA) is a statistical technique for revealing hidden factors that underlie sets of measurements. ICA was originated from the problem of blind source separation, in which only observations are Presented in BIOCOMP'07- The 2007 International Conference on Bioinformatics & Computational Biology