Biometrics 59, 992–1000 December 2003 Penalized Discriminant Methods for the Classification of Tumors from Gene Expression Data Debashis Ghosh Department of Biostatistics, University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48105, U.S.A. email: ghoshd@umich.edu Summary. Due to the advent of high-throughput microarray technology, it has become possible to develop molecular classification systems for various types of cancer. In this article, we propose a methodology using regularized regression models for the classification of tumors in microarray experiments. The performances of principal components, partial least squares, and ridge regression models are studied; these regression procedures are adapted to the classification setting using the optimal scoring algorithm. We also develop a procedure for ranking genes based on the fitted regression models. The proposed methodologies are applied to two microarray studies in cancer. Key words: Cross-validation; Microarrays; Partial least squares; Principal components; Regularization; Ridge regression. 1. Introduction With the development of large-scale, high-throughput gene expression technology, it has become possible to diagnose and classify disease, particularly cancer, based on these assays (Alizadeh et al., 2001). This has been termed “class predic- tion” in the microarray literature (Golub et al., 1999). An example of a microarray experiment in cancer is given by Khan et al. (2001). The goal of this study was to develop a method of classifying childhood cancers to certain diagnostic groupings utilizing the gene expression profiles. For the ex- periment, 63 training samples, representing various types of small, round blue cell tumors (SRBCTs), were collected; the gene expression profile was analyzed using cDNA microar- rays. The authors then used artificial neural networks (ANN) for training and generating a classification model for classi- fication of cancer based on the gene expression profiles. The authors then applied their ANN to a collection of 25 test sam- ples and found that the neural network model correctly clas- sified all 25 of the test cases (although five cases represented non-SRBCTs). In addition to the example, there have been several investi- gations utilizing supervised learning methods for the classifi- cation of tumors based on microarray data. Golub et al. (1999) utilized a nearest-neighbor classifier method for the classifica- tion of acute myeloid lymphoma (AML) and acute leukemia lymphoma (ALL) in children. Dudoit, Fridlyand, and Speed (2002) performed a systematic comparison of several discrim- ination methods for classification of tumors based on microar- ray experiments. While they found linear discriminant anal- ysis to perform the best, in order to utilize the method, the number of genes selected had to be drastically reduced from thousands to tens using a univariate filtering criterion. A more recent technique that is popular in computer science, namely, support vector machines, has also been applied to the classi- fication of tumors using microarray data (Yeang et al., 2001). There has also been some work on utilizing latent-factor models for classification (Li and Hong, 2001; West et al., 2001). One feature of microarray studies is the fact that the num- ber of tumor samples collected tends to be much smaller than the number of genes per chip. The former number tends to be on the order of tens or hundreds, while microarrays typ- ically contain thousands of genes on each chip. In statistical terms, the number of predictor variables is much larger than the number of independent samples. If the scientific question is to see whether or not gene expression profiles can predict tumor type, then from a regression point of view, it makes sense to think of the gene expression profile as the covari- ates. For these types of problems, it should be obvious that some type of regularization or variable reduction is needed. In most of the previous work described, the authors have used univariate methods for reducing the number of genes under consideration before applying the classification methods. An alternative approach was taken in Khan et al. (2001), where the authors applied principal components analysis to the gene expression data before training the ANN models. Another field in which the “large p, small n” (West, 2003, to appear) problem exists is chemometrics. Frank and Friedman (1993) proposed using regularized regression models for ana- lyzing chemometric data. However, in these settings, the re- sponse of interest is continuous, while for classification prob- lems, the label is a categorical variable. In this article, we present a methodology that extends the regularized regression models of chemometrics to classification 992