International Statistical Institute, 55 th Session 2005 PLS generalizations for dimensionality reduction in supervised classification Jose Vega-Vilca University of Puerto Rico at Mayaguez, Department of Mathematics Mayaguez Campus Mayaguez, Puerto Rico 00680 iose_vv@math.uprm.edu Edgar Acuña University of Puerto Rico at Mayaguez, Department of Mathematics Mayaguez Campus Mayaguez, Puerto Rico 00680 edgar@math.uprm.edu 1. Introduction The development of technologies such as microarrays have generated a large amount of data. The main characteristic of this kind of data is a large number of predictors (genes) and few observations (experiments). Thus, the data matrix X is of order np, where n is much smaller than p. Before using any multivariate statistical technique, such as regression and classification, to analyze the information contained on this data we need to apply either feature selection methods or dimensionality reduction using orthogonal variables in order to eliminate multicollineality among the predictors variables. Doing so we can avoid severe prediction errors as well as a decrease of the computational burden required to build and validate the classifier. Principal component analysis (PCA) is a technique that has being used for some time in order to reduce the dimensionality. However the first components that have the most variability of the data structure does not necessarily improve the prediction when is used for either regression or classification (Yeung and Ruzzo, 2001). Partial least squares (PLS), introduced by Wold in 1975, is other technique to reduce dimensionality in a regression context using orthogonal components. The certainty that first PLS components improve the prediction has made of PLS a technique widely used in particular in the chemistry field, know as Chemometrics. Nguyen and Rocke (2002a,b,c) working on supervised classification methods for microarray data reduced the dimensionality by applying first feature selection using statistical techniques, such as difference of means and analysis of variance, and after that they applied PLS regression considering the vector of classes (a categorical variable) as a response vector (continuous variable). This procedure is not adequate since the predictions not necessarily are integers and they must be rounded up losing accuracy. In spite of these shortcomings regression PLS has given reasonable results. In this paper, we implement generalizations of regression PLS as a dimensionality reduction technique for supervised classification. We extent a technique introduced by Bastien et al. (2002), who combined PLS with logistic regression for two classes problems and with ordinal logistic regression for multiclass problem. In this paper, we combined PLS with nominal logistic regression since it is very uncommon to have ordered classes. We also considered the multivariate PLS along with logistic regression, and the construction of PLS components from linear discriminant analysis, and projection pursuit. Our proposals improve two recent results by Fort and Lambert (2003) and Ding and Gentleman (2004) where logistic regression and PLS are combined too, but their methodology are suitable only for datasets with two classes.