Evaluating Two Approaches to Extracting Gene Regulatory Networks: Bayesian Networks and Association Rule Mining Zan Huang 1 , Jiexun Li 1 , Jie Xu 1 , Ritu Pandey 2 , Hsinchun Chen 1 Artificial Intelligence Lab 1 Department of Management Information Systems The University of Arizona Tucson, Arizona 85721, USA {zhuang, jiexun, jxu, hchen} @eller.arizona.edu Arizona Cancer Center, University of Arizona, Tucson, Arizona 85724, USA ritu@email.arizona.edu ABSTRACT Advances in microarray technologies have enabled simultaneous measurement of expression levels of thousands of genes, creating new opportunities and challenges for gene expression data analysis. Several recent studies have proposed to extract gene regulatory relations from microarray data with a wide range of techniques. However, because of the dimensionality problem in microarray data, most existing studies have included only a small number of genes. There is also a lack of evaluation of the extracted networks. Both problems have limited the practical value of the gene regulatory network analysis. In this paper, we present two algorithms for large-scale gene regulatory network analysis: an information-theory-based Bayesian network algorithm and a modified association rule mining algorithm. We also present two types of evaluations of the resulting networks: a simulation- based evaluation and an empirical evaluation. Six simulated gene expression datasets based on three pre-defined regulatory network models and two real datasets (a Saccharomyces cerevisiae dataset and a Homo sapiens dataset) were used in the evaluation study. The simulation-based evaluation results indicated that the two techniques could extract 30% - 60% correct relations when relation direction was not considered. The empirical evaluation showed that the extracted networks generally failed to identify regulatory relations reported in the literature. However, more than 50% of the extracted relations reflected gene co-occurrence patterns in the literature, and a small set of relations appeared to domain scientists to be potentially correct and interesting. Keywords Gene regulatory network, Bayesian network, Association rule. 1. INTRODUCTION Recent advances in microarray technologies have made possible large-scale gene expression analyses based on simultaneous measurements of thousands of genes. Many data mining techniques (e.g., clustering and classification) have been employed to uncover the biological functions of genes from microarray data. Recently, a reverse engineering approach has been used to extract gene regulatory networks in order to reveal the structure of the transcriptional gene regulation processes. The general goal of gene regulatory network analysis is to extract pronounced regulatory relations (e.g., activation and inhibition) between genes by examining the global gene expression patterns. The resulting network of regulatory processes may help researchers form new hypotheses about the behavior of biological systems and assist with the design of further experiments. Many studies have proposed various network extraction approaches. However, a common problem in these studies is that they include only a relatively small number of genes. This is mainly because of the inherent dimensionality problem in microarray data, which usually contain an insufficient number of samples of a large number of genes. To enable this type of analysis to capture the complexity of the biological systems, scalable techniques need to be developed to extract regulatory networks that contain a large number of genes. Another problem in previous studies is the lack of empirical evaluation of the gene regulatory 1