From Cancer Gene Expression Data to Simple Vital Rules
Rattikorn Hewett
Dept of Computer Science
Texas Tech University
rattikorn.hewett@ttu.edu
Ali Goksu
Dept of Computer Science
Texas Tech University
ali.goksu@ttu.edu
Soma Datta
Dept of Computer Science
Texas Tech University
soma.datta@ttu.edu
Abstract
Microarray gene expression profiling technology
generates huge high-dimensional data. Finding analysis
techniques that can cope with such data characteristics is
crucial in Bioinformatics. This paper proposes a
variation of an ensemble learning approach combined
with a clustering technique to extract “simple” and yet
“vital” rules from genomic data. The paper describes the
approach and evaluates it on cancer gene expression data
sets. We report experimental results including
comparisons with other results obtained from a similar
ensemble learning approach as well as some
sophisticated techniques such as support vector machines.
1. Introduction
In cancer research, gene expression data, generated by
DNA micro arrays, has been used to explore the
biological properties of tumors and to associate
expression patterns with clinical outcomes for patients in
various stages and different types of diseases [6, 7, 10, 12,
16]. This information can be useful to predict clinical and
pathological features relevant to treatment. DNA micro
arrays generate high-dimensional data. Furthermore,
because of the complexity and heterogeneity of cancerous
tumors, there is an increasing emphasis on comprehensive
analysis of integrated data sets, including histological,
clinical and pathological characteristics of tumor
formation and growth [16]. As a result, the already huge
number of dimensions dramatically increases. Gene
analysis techniques that can cope with high-dimensional
data are, therefore, critically important.
Data analysis of complex data sets can be approached
with machine learning. Unlike statistical approaches,
machine learning does not require hypothesis formation
prior to analysis. Many sophisticated techniques, such as
non-linear neural networks and support vector machines,
can produce accurate models [3, 14]. However, these
models tend to be complex and difficult to interpret,
limiting insights into the results.
Other data mining techniques produce models that
are easier to interpret. These are variations of association
rule mining [1] and decision tree learning [13]. The latter
is one of the most prominent machine learning techniques
for classification and has been widely used to produce
results in terms of rules (a set of conjunctive conditions
on relevant features associated with a predictive term).
However, when these rule conditions contain a large
number of features (or attributes) they can be relatively
hard to understand.
In dealing with large number of features, several
techniques exist for attribute selection, which uses
ranking based on various statistics (e.g., gain ratio,
entropy, chi square [10, 13, 15]). However, empirical
evidence has shown that rules with high discriminant
power may also include low-ranked features [10].
Therefore, using an attribute selection technique that
relies only on top-ranked attributes may miss an
opportunity to find a useful rule. Thus, an alternative
approach to attribute selection in preprocessing and a
learning technique that allows opportunities for rule
abstraction with low-ranked attributes should be explored.
One remedy to the above issue is an ensemble
learning technique (e.g., Bagging [2] and Boosting [5]),
which has been applied successfully to improve accuracy
of learning algorithms. In ensemble learning, we use a
learning algorithm to construct a committee of predictive
models (or classifiers) and obtain a prediction by
aggregating the resulting predictions from each of the
models constructed. In Bagging and Boosting, each model
is constructed from pseudo or bootstrapped training data,
respectively. Thus, each model is constructed from a very
different training set.
This paper proposes a variation of an ensemble
learning approach for extracting useful rules from high-
dimensional genomic data. Our approach is inspired by
the approach introduced by Li et al. [10]. However, the
difference is that we use rank-based projections of a
training data set to construct a tree committee instead of
forcing a feature with certain rank to be at the root of tree
during the tree committee building as in [10]. Although
the proposed technique is general in that it can be applied
329 1-4244-0359-6/06/$20.00 ©2006 IEEE.