978-1-61284-852-5/11/$26.00 ©2011 IEEE 105
A New Gene Subset Selection Approach Based on Linearly Separating Gene Pairs
Amirali Jafarian
School of Computer Science
University Of Windsor
Windsor, Ontario, Canada N9B 3P4
jafaria@uwindsor.ca
Alioune Ngom
School of Computer Science
University Of Windsor
Windsor, Ontario, Canada N9B 3P4
angom@cs.uwindsor.ca
Abstract—The concept of linear separability of gene expression
data sets with respect to two classes, has been recently studied
in literature. The problem is to efficiently find all pairs of genes
which induce a linear separation of the data. It has been
suggested that an underlying molecular mechanism relates
together the two genes of a separating pair to the phenotype
under study, such as a specific cancer. In this paper we study
the Containment Angle (CA) defined on the unit circle for a
linearly separable gene pair as a better alternative to the
paired t-test ranking function for gene selection. Using the CA
we also show empirically that a given classifier’s error is
related to the degree of linear separability of a given data set.
Finally we propose a new gene subset selection approach based
on the CA ranking function. Our approach gives better results,
in terms of subset size and classification accuracy when
compared to well-performing methods, on many data sets.
Keywords-linearly separability; gene expression;
classification; Containment Angle; gene ranking; subset
selection
I. INTRODUCTION
DNA microarrays give the expression levels for
thousands of genes in parallel either for a single tissue
sample, condition, or time point. Microarray data sets are
usually noisy with a low sample size given the large number
of measured genes. Such data sets present many difficult
challenges for sample classification algorithms: too many
genes are noisy, irrelevant or redundant for the learning
problem at hand. Our present work introduces gene subset
selection methods based on the concept of linear
separability of gene expression data sets as introduced
recently in [1]. We use their geometric notion of linear
separation by pairs of genes (where samples belong to one
of two distinct classes termed red and blue samples in [1])
to define a simple criterion for selecting (best subsets of)
genes for the purpose of sample classification. Gene subset
selection methods have received considerable attention in
recent years as a better dimensionality reduction method
than feature extraction methods which yield features that are
difficult to interpret. The gene subset selection problem is to
determine the smallest subset of genes whose expression
values allow sample classification with the highest possible
accuracy. Many approaches have been proposed in the
literature to solve this problem. A simple and common
method is the filter approach which first ranks single genes
according to how well they each separate the classes (we
assume two classes in this paper), and then selects the top r
ranked genes as the gene subset to be used; where r is the
smallest integer, which yields the best classification
accuracy, when using the subset. Many gene ranking criteria
are proposed based on different (or a combination of)
principles, including redundancy, relevancy, or others [2],
[6]. Filter methods are simple and fast, but they do not
necessarily produce the best gene subsets; since there are
gene subsets allowing better separation than the best subsets
of top ranked genes. Other methods introduced in literature
are the wrapper approaches, which evaluate subsets of
genes irrespective of any possible ranking over the genes.
Such methods are based on heuristics which directly search
the space of gene subsets and guided by a classifier’s
performance on the selected gene subsets [9]. The best
methods combine both gene ranking and wrapper
approaches but are computationally intensive.
Our approach in this paper is to use and evaluate pairs
of genes, rather than single genes, for the purpose of finding
the best gene subsets. We propose a simple but new ranking
criterion for gene pairs in order to evaluate how well each
pair separates the classes. Additionally in order to find the
best gene subsets, we devise a filter method, based on
selecting only linearly separating gene pairs. A similar
method in which gene pairs are used for the purpose of
finding best gene subsets was first introduced in [2]. Given
a gene pair, the authors used diagonal linear discriminant
(DLD) and compute the projected coordinate of each sample
data on the DLD axis using only the two genes, and then
take the two-sample t-statistic on these projected samples as
the pair’s score. The authors then devised two filter methods
for gene subset selection based on the pair t-scores. Our
method is different in that we: 1) used a ranking criterion
based on the geometric notion of linear separation by gene
pairs as introduced in [1], and 2) devised a filter method for
gene subset selection which is based on our pair scores.
LINEAR SEPARABILITY OF EXPRESSION DATA SETS
Recently, [1] proposed a geometric notion of linear
separation by gene pairs, in the context of gene expression
data sets, where samples belong to one of two distinct
classes, termed red and blue classes. The authors then
introduced a novel highly efficient algorithm for finding all
gene pairs that induce a linear separation of the two-class
samples. Let m = m
1
+ m
2
be the number of samples, out of