Scaling-Up Bayesian Network Learning to Thousands of Variables Using Local Learning Techniques Ioannis Tsamardinos, Ph.D., Constantin F. Aliferis, M.D., Ph.D., Alexander Statnikov, M.S., Laura E. Brown, M.S. Department of Biomedical Informatics, Vanderbilt University, Nashville, TN ABSTRACT State-of-the-art Bayesian Network learning algo- rithms do not scale to more than a few hundred vari- ables; thus, they fall far short from addressing the challenges posed by the large datasets in biomedical informatics (e.g., gene expression, proteomics, or text-categorization data). In this paper, we present a BN learning algorithm, called the Max-Min Bayesian Network learning (MMBN) algorithm that can induce networks with tens of thousands of variables, or al- ternatively, can selectively reconstruct regions of interest if time does not permit full reconstruction. MMBN is based on a local algorithm that returns targeted areas of the network and on putting these pieces together. On a small dataset MMBN outper- forms other state-of-the-art methods. Subsequently, its scalability is demonstrated by fully reconstructing from data a Bayesian Network with 10,000 variables using ordinary PC hardware. The novel algorithm pushes the envelope of Bayesian Network learning (an NP-complete problem) by about two orders of magnitude. 1. Introduction Bayesian Networks (BN) is a formalization that has proved itself a useful and important tool in medicine for building decision support systems [2] and in bio- informatics for discovering gene expression pathways [4]. Automatically learning BNs from observational data has been an area of intense research for more than a decade yielding practical algorithms and tools. The BN representation and learning algorithms natu- rally lend themselves causal modeling and causal discovery [5]. Despite the great advances in BN learning tech- niques, they have not yet proved themselves up to the challenge posed in a number of current domains, such as gene expression, proteomics, and text categoriza- tion. Current techniques only scale up to a few hun- dred variables in the best case. For example, the pub- lic available versions of the PC [10] and the TPDA (also called PowerConstructor) [6] algorithms accept datasets with only 100 and 255 variables respec- tively, indicating the expectations of the inventors regarding their scalability. In this paper, we present a BN learning algorithm called Max-Min Bayesian Network (MMBN) that is able to scale up to tens of thousands of variables, thus pushing the envelope of BN learning by about two orders of magnitude. When compared with a variety of state-of-the-art methods in the field on a small dataset, it exhibits improved quality of output. In addition, the performance of the algorithm is still excellent in reconstructing from data a BN with 10,000 variables using 1000 training instances. MMBN has an additional important advantage over the previous methods: it is an anytime algorithm in the sense that one can stop it at any time and re- cover only an area of interest around a target variable T instead of the full BN. The longer the algorithm is allowed to run, the larger the reconstructed area around T will be. This property allows the algorithm to learn at least parts of extremely large networks; this is empirically demonstrated with a proof-of- concept experiment. 2. The Max-Min Bayesian Network (MMBN) Algorithm Bayesian Networks (BN) [10] are mathematical ob- jects that compactly represent a joint probability dis- tribution J among a set of random variables Φ (also called nodes) using a directed acyclic graph G. As a first step, MMBN discovers the edges of the BN, and in the second step orients them. However, for the rest of the paper we only focus on the first step of edge discovery. The orientation phase of the PC algorithm or a constrained hill-climbing search- and-score in the fashion of the Sparse Candidate [3] can then be used for edge orientation. Edge identifi- cation is important by itself since an edge between variables X and Y under certain conditions corre- sponds to a direct causal relation between the two variables, i.e., X directly causing Y or vice versa [5, 9, 10]. Thus, edges may be used to generate accurate hypotheses for causal discovery. Since there may be many BNs that can capture the data joint distribution, the question is which one will MMBN discover? In [10] it is proved that all BN that are faithful to the same joint data distribution have the same set of edges. This unique set of edges is the one discovered by MMBN. A BN is faithful to a joint distribution if all and only the independencies in the distribution are entailed by the Markov Condition.