The Generalized LASSO: a wrapper approach to gene selection for microarray data. Volker Roth University of Bonn, Computer Science III, Roemerstr. 164, D-53117 Bonn, Germany August 16, 2002 Abstract We report on the successful application of the Generalized LASSO method to feature selection problems for microarray data. This method implements a wrap- per strategy for selecting relevant genes by optimizing the discriminative power of a logistic classification model. The selection process can be interpreted as a spe- cial instance of the Bayesian automatic relevance determination (ARD) principle. The most outstanding properties of the Generalized LASSO are: (i) excellent gen- eralization ability; (ii) probabilistic outputs, rather than only binary class labels; (iii) generic definition of a doubt class collecting samples with uncertain predicted label; (iv) simultaneous assessment of prediction strength and stability of gene se- lection under resampling; (v) highly efficient optimization algorithm, capable of dealing with large-scale real-world applications. Experiments for several microar- ray datasets demonstrate both the outstanding classification performance and the biological relevance of the selected genes. 1 Introduction Microarray experiments typically measure the expression levels of several thousands of genes simultaneously. In the context of cancer diagnostics, the discrimination be- tween either diseased and healthy tissue, or between different kinds of diseased tissues, constitutes a major application for this technology. The aim is to improve the under- standing of cancer pathogenesis on a molecular level. A central goal of the analysis of microarray data is the identification of small subsets of informative genes with cancer- specific expression profiles. The knowledge of highly informative marker genes is cru- cial for clinical applications, where the availability of inexpensive and fast diagnostic procedures is a major concern. From a machine learning viewpoint, the above problem of identifying relevant genes is known as the problem of feature selection in supervised learning. The mo- tivation for selecting a subset of features from which a learning rule is constructed may be twofold: from a signal-processing perspective, we focus on separating “signal” and 1