Learning a L1-regularized Gaussian Bayesian network in the space of equivalence classes Diego Vidaurre dvidaurre@gmail.com Concha Bielza mcbielza@fi.upm.es Pedro Larra˜ naga pedro.larranaga@fi.upm.es Departamento de Inteligencia Artiﬁcial. Universidad Polit´ ecnica de Madrid,Campus de Montegancedo s/n, Spain Keywords: Gaussian networks, model selection, regularization, Lasso, learning from data, equivalence classes Abstract Learning the structure of a graphical model from data is a common task in a wide range of practical applications. In this paper we focus on Gaussian Bayesian networks (GBN), that is, on continuous data and directed graphs. We propose to work in an equivalence class search space that, combined with regulariza- tion techniques to guide the search of the structure, allows to learn a sparse network close to the one that generated the data. 1. Introduction In a GBN, we have a joint Gaussian probability dis- tribution on a ﬁnite set of variables. Each variable X i has its own univariate normal distribution given its parents. The relationships between the variables are always directed. The joint probability distribution is built on the product of the distributions of each X i . There are two basic approaches for structure learn- ing: independence tests and score+search methods. A popular score+search procedure is based on greedy methods, where a score function and a neighbour- hood have to be deﬁned. To represent the solutions and move in the search space, we typically choose be- tween directed acyclic graphs (DAGs) or partial di- rected acyclic graphs (PDAGs). An equivalence class, modelled by a PDAG, is the set of graphs that encodes a unique probability distribution with the same condi- tional independences. It is often the preferred repre- sentation and the only one capable to ﬁt the inclusion boundary (IB) requirement (Chickering, 2002). Appearing in Proceedings of the 25 th International Confer- ence on Machine Learning, Helsinki, Finland, 2008. Copy- right 2008 by the author(s)/owner(s). Three main problems arise when working in the DAG space rather than with equivalence classes. First, some operators deﬁned to move between DAGs may operate between graphs in the same equivalence class, being a useless waste of time. Second, a random factor is introduced: the connectivity of the current real state with other states depends on the speciﬁc DAG of the equivalence class, so the best transition might not be performed if equivalence classes are ignored. Given that all members of a class score the same value, there is no reason other than randomness to prefer a speciﬁc member. The third problem is related to the a priori probability that the ﬁnal DAG belongs to a speciﬁc equivalence class. If all DAGs in the same class are interpreted as diﬀerent models, the a priori probability of an equivalence class to be the ﬁnal output is the sum of the a priori probabilities of each DAG inside it. It has consequences that are diﬃcult to be predicted. A model is deﬁned to be inclusion optimal with regard to a distribution p(X 1 , ..., X n ) if it includes p(X 1 , ..., X n ) and there is not any model strictly in- cluded in the model that includes p(X 1 , ..., X n ). It is well-known that a greedy algorithm that respects the IB neighbourhood and uses a locally consistent scoring criterion is asymptotically optimal under the faithful assumption, and inclusion optimal under the weaker composition property assumption (Chickering & Meek, 2002). This is the case of the greedy equiv- alence search algorithm (GES) developed in (Chick- ering, 2002). As a generalization of algorithm GES, (Nielsen et al., 2003) implement the KES algorithm –k-greedy equivalence search– for Bayesian network learning, where a stochastic factor is introduced so that multiple runs can be made to extract common patterns on the solutions, that is, to include in the ﬁ- nal model just those arcs that showed up in most of the runs. For simplicity, we will not work with several