BIOINFORMATICS Vol. 18 Suppl. 1 2002 Pages S120–S127 Linking gene expression data with patient survival times using partial least squares Peter J. Park 1 , Lu Tian 2 and Isaac S. Kohane 1 1 Children’s Hospital Informatics Program and Harvard Medical School, 300 Longwood Ave, Boston, MA, 02115, USA and 2 Department of Biostatistics, Harvard School of Public Health, 655 Huntington Ave, Boston, MA, 02115, USA Received on January 24, 2002; revised and accepted on April 1, 2002 ABSTRACT There is an increasing need to link the large amount of genotypic data, gathered using microarrays for example, with various phenotypic data from patients. The classifi- cation problem in which gene expression data serve as predictors and a class label phenotype as the binary out- come variable has been examined extensively, but there has been less emphasis in dealing with other types of phe- notypic data. In particular, patient survival times with cen- soring are often not used directly as a response variable due to the complications that arise from censoring. We show that the issues involving censored data can be circumvented by reformulating the problem as a standard Poisson regression problem. The procedure for solving the transformed problem is a combination of two approaches: partial least squares, a regression technique that is especially effective when there is severe collinearity due to a large number of predictors, and generalized linear regression, which extends standard linear regression to deal with various types of response variables. The linear combinations of the original variables identified by the method are highly correlated with the patient survival times and at the same time account for the variability in the covariates. The algorithm is fast, as it does not involve any matrix decompositions in the iterations. We apply our method to data sets from lung carcinoma and diffuse large B-cell lymphoma studies to verify its effectiveness. Contact: peter park@harvard.edu Keywords: microarrays; generalized linear models; sur- vivial analysis; Poisson regression; principal components analysis. INTRODUCTION Simultaneous measurement of mRNA transcripts for thousands of genes using microarrays has made it possi- ble to study gene expression on a genome-wide scale (for overview, see Collins (1999) and the articles that follow). The two most common types are oligonucleotide and cDNA arrays, but other platforms such as SAGE (Serial Analysis of Gene Expression) are also available. Expres- sion profiling has been used in several contexts, notably in functional characterization of genes and classification of disease types. One of the great challenges in medicine is to correlate genotypic data, such as gene expression measurements and presence of single nucleotide polymorphisms, and other covariates, such as age and gender, to a variety of phenotypic data from the patient. Capturing the relation- ship between the phenotype and the genotype would not only allow for a predictive model that can aid in diagnosis and treatment, but also bring about a better understanding of the basic biological processes. The phenotypes considered in many studies so far have been limited to relatively simple cases. The most common is the binary type, typically comparing one disease against normal or another disease (Golub et al., 1999; Alon et al., 1999; Alizadeh et al., 2000). Larger data sets containing several types of a disease have also become common, and multiclass classification has started to receive more attention recently (Ramaswamy et al., 2001; Bhattacharjee et al., 2001). In general, however, phenotypic data can take several forms. It may be, for example, ‘count’ data, such as the number of recurrences of a disease, or continuous data, such as blood pressure. One particularly important case is that of patient survival time, such as the time from the beginning of a treatment to a ‘failure’, usually an occurrence of a particular condition or death. The difficulty in dealing with survival data is that failure times may not always be observed. That means for some patients, failure occurs past a certain time but the exact time is not known (‘right-censoring’). This happens, for example, when a clinical trial is terminated before all the patients have failed, or a patient leaves the study early. Unfortunately, many of the current algorithms for linking gene expression data with phenotypic data cannot be easily extended to the more general cases. A major source of difficulty in dealing with microarray data is that the number of variables (genes) is much S120 c Oxford University Press 2002