Biometrics 59, 992–1000 December 2003 Penalized Discriminant Methods for the Classiﬁcation of Tumors from Gene Expression Data Debashis Ghosh Department of Biostatistics, University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48105, U.S.A. email: ghoshd@umich.edu Summary. Due to the advent of high-throughput microarray technology, it has become possible to develop molecular classiﬁcation systems for various types of cancer. In this article, we propose a methodology using regularized regression models for the classiﬁcation of tumors in microarray experiments. The performances of principal components, partial least squares, and ridge regression models are studied; these regression procedures are adapted to the classiﬁcation setting using the optimal scoring algorithm. We also develop a procedure for ranking genes based on the ﬁtted regression models. The proposed methodologies are applied to two microarray studies in cancer. Key words: Cross-validation; Microarrays; Partial least squares; Principal components; Regularization; Ridge regression. 1. Introduction With the development of large-scale, high-throughput gene expression technology, it has become possible to diagnose and classify disease, particularly cancer, based on these assays (Alizadeh et al., 2001). This has been termed “class predic- tion” in the microarray literature (Golub et al., 1999). An example of a microarray experiment in cancer is given by Khan et al. (2001). The goal of this study was to develop a method of classifying childhood cancers to certain diagnostic groupings utilizing the gene expression proﬁles. For the ex- periment, 63 training samples, representing various types of small, round blue cell tumors (SRBCTs), were collected; the gene expression proﬁle was analyzed using cDNA microar- rays. The authors then used artiﬁcial neural networks (ANN) for training and generating a classiﬁcation model for classi- ﬁcation of cancer based on the gene expression proﬁles. The authors then applied their ANN to a collection of 25 test sam- ples and found that the neural network model correctly clas- siﬁed all 25 of the test cases (although ﬁve cases represented non-SRBCTs). In addition to the example, there have been several investi- gations utilizing supervised learning methods for the classiﬁ- cation of tumors based on microarray data. Golub et al. (1999) utilized a nearest-neighbor classiﬁer method for the classiﬁca- tion of acute myeloid lymphoma (AML) and acute leukemia lymphoma (ALL) in children. Dudoit, Fridlyand, and Speed (2002) performed a systematic comparison of several discrim- ination methods for classiﬁcation of tumors based on microar- ray experiments. While they found linear discriminant anal- ysis to perform the best, in order to utilize the method, the number of genes selected had to be drastically reduced from thousands to tens using a univariate ﬁltering criterion. A more recent technique that is popular in computer science, namely, support vector machines, has also been applied to the classi- ﬁcation of tumors using microarray data (Yeang et al., 2001). There has also been some work on utilizing latent-factor models for classiﬁcation (Li and Hong, 2001; West et al., 2001). One feature of microarray studies is the fact that the num- ber of tumor samples collected tends to be much smaller than the number of genes per chip. The former number tends to be on the order of tens or hundreds, while microarrays typ- ically contain thousands of genes on each chip. In statistical terms, the number of predictor variables is much larger than the number of independent samples. If the scientiﬁc question is to see whether or not gene expression proﬁles can predict tumor type, then from a regression point of view, it makes sense to think of the gene expression proﬁle as the covari- ates. For these types of problems, it should be obvious that some type of regularization or variable reduction is needed. In most of the previous work described, the authors have used univariate methods for reducing the number of genes under consideration before applying the classiﬁcation methods. An alternative approach was taken in Khan et al. (2001), where the authors applied principal components analysis to the gene expression data before training the ANN models. Another ﬁeld in which the “large p, small n” (West, 2003, to appear) problem exists is chemometrics. Frank and Friedman (1993) proposed using regularized regression models for ana- lyzing chemometric data. However, in these settings, the re- sponse of interest is continuous, while for classiﬁcation prob- lems, the label is a categorical variable. In this article, we present a methodology that extends the regularized regression models of chemometrics to classiﬁcation 992