Learning Causal Networks from Microarray Data Nasir Ahsan † Michael Bain †* John Potter † Bruno Ga¨ eta †‡ Mark Temple ‡ Ian Dawes ‡ † School of Computer Science and Engineering ‡ School of Biotechnology and Biomolecular Sciences University of New South Wales, Sydney, Australia 2052 * Email: mike@cse.unsw.edu.au Abstract We report on a new approach to modelling and identi- fying dependencies within a gene regulatory cycle. In particular, we aim to learn the structure of a causal network from gene expression microarray data. We model causality in two ways: by using conditional de- pendence assumptions to model the independence of different causes on a common effect; and by relying on time delays between cause and effect. Networks therefore incorporate both probabilistic and tempo- ral aspects of regulation. We are thus able to deal with cyclic dependencies amongst genes, which is not possible in standard Bayesian networks. However, our model is kept deliberately simple to make it amenable for learning from microarray data, which typically contains a small number of samples for a large number of genes. We have developed a learning algorithm for this model which was implemented and experimen- tally validated against simulated data and on yeast cell cycle microarray time series data sets. 1 Introduction With the complete genomic sequences of increasing numbers of organisms available there are unprece- dented opportunities for the biological analysis of complex cellular processes. In this new era of cell biology there have been major advances in the ability to collect data on a genome-wide scale by the use of high-throughput technology such as gene expression microarrays. However this data requires analysis be- yond the level of individual genes; we need to investi- gate networks of genes acting within highly regulated systems. Owing to the complexity of the systems to be stud- ied a wide range of methods to model networks and learn them from data have been studied (Endy & Brent 2001, de Jong 2002). However, in our work we are not aiming to model the full dynamical systems of the cell, but rather to learn key causal features from data. In this paper we investigate the use of causal net- works to model relations between genes as measured in microarray data. More specifically, we explore learning the structure of a genetic regulatory net- work in this setting. Our approach to causal networks is based on a form of dynamic Bayesian network in which causality is represented in two ways: by relying on conditional independence assumptions to model the independence of different causes on a common ef- fect; and by relying on time delays between cause and effect. By incorporating time delays into our model, Copyright c 2006, Australian Computer Society, Inc. This pa- per appeared at The 2006 Workshop on Intelligent Systems for Bioinformatics (WISB2006), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 73. Mikael Bod´ en and Timothy L. Bailey, Ed. Reproduc- tion for academic, not-for profit purposes permitted provided this text is included. we are able to deal with cyclic dependencies amongst genes. From a high-level perspective, we address the following problems: 1. How can we formalize the model of a causal net- work which combines probabilistic and temporal aspects, in particular, conditional independence and temporal precedence? 2. How can we learn such a model from observa- tions of the variables over time in a robust and computationally efficient manner? The paper is organised as follows. In Section 2 we introduce our representation for causal models, and in Section 3 we develop a learning algorithm for such models. In Section 4 we present results from experi- mental application of the approach to cell cycle time series microarray data. 2 A new probabilistic model We introduce a probabilistic model that satisfies cer- tain assumptions and addresses the problems of com- plexity discussed above. We then extend this proba- bilistic model to include time, and develop methods for estimating our probabilistic model from time se- ries data, specifically microarray data. Note that in this paper our models contain only discrete random variables. 2.1 The F -model In the Bayesian network model an effect is condition- ally independent of its ancestors given its immediate causes. This allows one to model an effect θ with immediate causes β 1 ,...,β n as shown on the left in Figure 1. Unfortunately modelling an effect in this way requires the estimation of at least 2 n parame- ters, where n is the number of causes. Clearly the number of probabilities to be estimated increases ex- ponentially in n. However, if certain combinations of causes are known a priori to be unlikely they may be safely ignored in the interests of simplifying the problem. This suggests a way to reduce the number of distinct events that need to be modelled, effectively compress- ing the space of events for which probabilities need to be estimated. To do this we will use a function F , called a compression function, to compress the joint probability distribution. In the standard Bayesian network framework the conditional probability for the dependence of an effect θ on its causes β 1 ,...,β n is P (θ | β 1 ,...,β n ). Using the compression function this becomes P (θ | F (β 1 ,...,β n )), as shown on the right in Figure 1. If we assume that P (θ | β 1 ,...,β n ) ≡ P (θ | F (β 1 ,...,β n )), for some arbitrary F , a joint proba- bility distribution may be decomposed using F -based conditional independence factors. Definition 1 (F -model) Let M denote a proba- bility model over the random variable space Θ =