978-1-61284-852-5/11/$26.00 ©2011 IEEE 105 A New Gene Subset Selection Approach Based on Linearly Separating Gene Pairs Amirali Jafarian School of Computer Science University Of Windsor Windsor, Ontario, Canada N9B 3P4 jafaria@uwindsor.ca Alioune Ngom School of Computer Science University Of Windsor Windsor, Ontario, Canada N9B 3P4 angom@cs.uwindsor.ca Abstract—The concept of linear separability of gene expression data sets with respect to two classes, has been recently studied in literature. The problem is to efficiently find all pairs of genes which induce a linear separation of the data. It has been suggested that an underlying molecular mechanism relates together the two genes of a separating pair to the phenotype under study, such as a specific cancer. In this paper we study the Containment Angle (CA) defined on the unit circle for a linearly separable gene pair as a better alternative to the paired t-test ranking function for gene selection. Using the CA we also show empirically that a given classifier’s error is related to the degree of linear separability of a given data set. Finally we propose a new gene subset selection approach based on the CA ranking function. Our approach gives better results, in terms of subset size and classification accuracy when compared to well-performing methods, on many data sets. Keywords-linearly separability; gene expression; classification; Containment Angle; gene ranking; subset selection I. INTRODUCTION DNA microarrays give the expression levels for thousands of genes in parallel either for a single tissue sample, condition, or time point. Microarray data sets are usually noisy with a low sample size given the large number of measured genes. Such data sets present many difficult challenges for sample classification algorithms: too many genes are noisy, irrelevant or redundant for the learning problem at hand. Our present work introduces gene subset selection methods based on the concept of linear separability of gene expression data sets as introduced recently in [1]. We use their geometric notion of linear separation by pairs of genes (where samples belong to one of two distinct classes termed red and blue samples in [1]) to define a simple criterion for selecting (best subsets of) genes for the purpose of sample classification. Gene subset selection methods have received considerable attention in recent years as a better dimensionality reduction method than feature extraction methods which yield features that are difficult to interpret. The gene subset selection problem is to determine the smallest subset of genes whose expression values allow sample classification with the highest possible accuracy. Many approaches have been proposed in the literature to solve this problem. A simple and common method is the filter approach which first ranks single genes according to how well they each separate the classes (we assume two classes in this paper), and then selects the top r ranked genes as the gene subset to be used; where r is the smallest integer, which yields the best classification accuracy, when using the subset. Many gene ranking criteria are proposed based on different (or a combination of) principles, including redundancy, relevancy, or others [2], [6]. Filter methods are simple and fast, but they do not necessarily produce the best gene subsets; since there are gene subsets allowing better separation than the best subsets of top ranked genes. Other methods introduced in literature are the wrapper approaches, which evaluate subsets of genes irrespective of any possible ranking over the genes. Such methods are based on heuristics which directly search the space of gene subsets and guided by a classifier’s performance on the selected gene subsets [9]. The best methods combine both gene ranking and wrapper approaches but are computationally intensive. Our approach in this paper is to use and evaluate pairs of genes, rather than single genes, for the purpose of finding the best gene subsets. We propose a simple but new ranking criterion for gene pairs in order to evaluate how well each pair separates the classes. Additionally in order to find the best gene subsets, we devise a filter method, based on selecting only linearly separating gene pairs. A similar method in which gene pairs are used for the purpose of finding best gene subsets was first introduced in [2]. Given a gene pair, the authors used diagonal linear discriminant (DLD) and compute the projected coordinate of each sample data on the DLD axis using only the two genes, and then take the two-sample t-statistic on these projected samples as the pair’s score. The authors then devised two filter methods for gene subset selection based on the pair t-scores. Our method is different in that we: 1) used a ranking criterion based on the geometric notion of linear separation by gene pairs as introduced in [1], and 2) devised a filter method for gene subset selection which is based on our pair scores. LINEAR SEPARABILITY OF EXPRESSION DATA SETS Recently, [1] proposed a geometric notion of linear separation by gene pairs, in the context of gene expression data sets, where samples belong to one of two distinct classes, termed red and blue classes. The authors then introduced a novel highly efficient algorithm for finding all gene pairs that induce a linear separation of the two-class samples. Let m = m 1 + m 2 be the number of samples, out of