Exploring Matrix Factorization Techniques for Classification of Gene Expression Profiles R. Schachtner, D. Lutter, A. M. Tom´ e, E. W. Lang Abstract High-throughput genome-wide measurements of gene transcript levels have become available with the recent development of microarray technology. Intel- ligent and efficient mathematical and computational analysis tools are needed to read and interpret the information content buried in those large scale gene expres- sion patterns at various levels of resolution. Modern machine learning techniques based on matrix decomposition techniques, like Independent Component Analy- sis (ICA) and Nonnegative Matrix Factorization (NMF), provide new and efficient analysis tools which are currently explored in this area. ICA decomposes such expression profiles into independent expression modes, while NMF groups genes together to metagenes. All these extracted features are considered indicative of underlying regulatory processes. These exploratory methods can be applied to the classification of gene expression data sets to group samples into different cate- gories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. In this study we focus on classification tasks and apply ICA and sparse NMF to various microarray data sets. The latter monitor the gene expression levels of ei- ther human breast cancer (HBC) cell lines, the famous leukemia data set or cell lines from a monocyte-macrophage (MoMa) differentiation study. The HBC data were taken from [1] and the leukemia data set is taken from [2]. We show that these tools are able to identify relevant signatures in the deduced matrices and extract marker genes from these gene expression profiles without the need for ex- tensive data bank search for appropriate functional annotations. With these marker genes corresponding test data sets can easily be classifies into related diagnostic categories. The latter correspond to the ability of HBC cell lines to induce bone cancer metastasis either strongly or weakly , to the AML - ALL leukemia types and the related ALL subtypes, or to either monocyte vs macrophage cells or the classes healthy vs Niemann-Pick C patient. The HBC data set consists of an in vivo extracted sub-population of 14 cell lines, which is used to train the classi- fier, and another 11 cell lines derived from single cells to test the performance of the classifier. Correspondingly the leukemia data set consists of 38 cell lines as training set and another 34 cell lines for testing. For the MoMa data set human peripheral blood monocytes were isolated from healthy donors (exp. 1 and 2) and from donors with Niemann-Pick type C disease (exp. 3). The data set consisted of seven monocyte and seven macrophage expression profiles. Our results demon- strate that these methods are able to identify suitable marker genes which can be used to classify the type of tumor investigated. References [1] Y. Kang, P.M. Siegel, A. Shu, M. Drobnjak, S.M. Kakonen, C. Cord´ on, Th.A. Guise, and J. Massagu´ e. A multigenic program mediating breast cancer metastasis to bone. Cancer Cell, 3:537–549, 2003. [2] T.R.Golub, D.K.Slonim, P.Tamayo, C.Huard, M.Gaasenbeek, J.P. Mesirov, H.Coller, M.L.Loh, J.R.Downing, M.A.Caligiuri, C.D.Bloomfield, and E.S.Lander. Molec- ular classifcation of cancer: Class discovery and class prediction by gene expression monitoring. SCIENCE, 286, 1999.