From Cancer Gene Expression Data to Simple Vital Rules Rattikorn Hewett Dept of Computer Science Texas Tech University rattikorn.hewett@ttu.edu Ali Goksu Dept of Computer Science Texas Tech University ali.goksu@ttu.edu Soma Datta Dept of Computer Science Texas Tech University soma.datta@ttu.edu Abstract Microarray gene expression profiling technology generates huge high-dimensional data. Finding analysis techniques that can cope with such data characteristics is crucial in Bioinformatics. This paper proposes a variation of an ensemble learning approach combined with a clustering technique to extract “simple” and yet “vital” rules from genomic data. The paper describes the approach and evaluates it on cancer gene expression data sets. We report experimental results including comparisons with other results obtained from a similar ensemble learning approach as well as some sophisticated techniques such as support vector machines. 1. Introduction In cancer research, gene expression data, generated by DNA micro arrays, has been used to explore the biological properties of tumors and to associate expression patterns with clinical outcomes for patients in various stages and different types of diseases [6, 7, 10, 12, 16]. This information can be useful to predict clinical and pathological features relevant to treatment. DNA micro arrays generate high-dimensional data. Furthermore, because of the complexity and heterogeneity of cancerous tumors, there is an increasing emphasis on comprehensive analysis of integrated data sets, including histological, clinical and pathological characteristics of tumor formation and growth [16]. As a result, the already huge number of dimensions dramatically increases. Gene analysis techniques that can cope with high-dimensional data are, therefore, critically important. Data analysis of complex data sets can be approached with machine learning. Unlike statistical approaches, machine learning does not require hypothesis formation prior to analysis. Many sophisticated techniques, such as non-linear neural networks and support vector machines, can produce accurate models [3, 14]. However, these models tend to be complex and difficult to interpret, limiting insights into the results. Other data mining techniques produce models that are easier to interpret. These are variations of association rule mining [1] and decision tree learning [13]. The latter is one of the most prominent machine learning techniques for classification and has been widely used to produce results in terms of rules (a set of conjunctive conditions on relevant features associated with a predictive term). However, when these rule conditions contain a large number of features (or attributes) they can be relatively hard to understand. In dealing with large number of features, several techniques exist for attribute selection, which uses ranking based on various statistics (e.g., gain ratio, entropy, chi square [10, 13, 15]). However, empirical evidence has shown that rules with high discriminant power may also include low-ranked features [10]. Therefore, using an attribute selection technique that relies only on top-ranked attributes may miss an opportunity to find a useful rule. Thus, an alternative approach to attribute selection in preprocessing and a learning technique that allows opportunities for rule abstraction with low-ranked attributes should be explored. One remedy to the above issue is an ensemble learning technique (e.g., Bagging [2] and Boosting [5]), which has been applied successfully to improve accuracy of learning algorithms. In ensemble learning, we use a learning algorithm to construct a committee of predictive models (or classifiers) and obtain a prediction by aggregating the resulting predictions from each of the models constructed. In Bagging and Boosting, each model is constructed from pseudo or bootstrapped training data, respectively. Thus, each model is constructed from a very different training set. This paper proposes a variation of an ensemble learning approach for extracting useful rules from high- dimensional genomic data. Our approach is inspired by the approach introduced by Li et al. [10]. However, the difference is that we use rank-based projections of a training data set to construct a tree committee instead of forcing a feature with certain rank to be at the root of tree during the tree committee building as in [10]. Although the proposed technique is general in that it can be applied 329 1-4244-0359-6/06/$20.00 ©2006 IEEE.