Common Roots from High Stochastic Dependence Bastian Steudel and Nihat Ay MPI for Mathematics in the Sciences, Leipzig, Germany Setting We use Bayesian nets as a formalization of the probabilistic and causal relations of a system and present a result that describes how information theoretic means can contribute to the causal inference process. Task of Causal Inference Starting from an observation of a subsystem in terms of a probability dis- tribution of random variables, determine the class of Bayesian nets that are consistent with the observation. X1 X2 X3 X4 X1 X2 X3 X4 Subsystem ? Observation Bayesian net, such that p(x1,...,x4)= ∑ u pB(x1,...,x4,u) p(x1,...,x4) Inference Graphical Models and Information Theory For a given directed acyclic graph G whose nodes are discrete random vari- ables X1,...,Xn, denote by P (G) the family of joint probability distribu- tions which factor according to G. Then for an arbitrary distribution p of the Xi Ip ` X1,...,Xn ´ = D ` p || P (G) ´ + n X i=1 Ip ` Xi, parents(Xi) ´ , (*) where • D(p||P (G)) := inf q∈P(G) D(p || q) is the distance of p from the family of distributions P (G), measured in terms of Kullback-Leibler diver- gence D(p||q)= P p log p/q. • Ip is the (generalized) mutual information Ip(X1,...,Xn)= D ` p || p(x1) ⊗···⊗ p(xn) ´ . General Question: Relation (*) only holds for distributions p de- ﬁned on all nodes of the graph, so if the whole system has been observed. What can be said in cases of incomplete knowledge, that is if there are unobserved variables? Inference of Common Roots Common Roots of Two Variables: If X1 and X2 are stochastically dependent, then in every causal model we have: X 1 X 2 X 1 X 2 X 1 X 2 a path from X 2 to X 1 a path from X 1 to X 2 a node with paths to X 1 and X 2 or or Stochastic dependence = ⇒ existence of a of two variables common ancestor Deﬁnition. Let X = {X1,...,X k } be nodes of a Bayesian net B. A node which is an ancestor of at least c nodes of X is called a common root of order c. ? What is a suﬃcient condition for the existence of common roots of more than two variables? X 1 X 2 X 3 X 1 X 2 X 3 Example: Two causal models of a subsystem {X1,X2,X3}: • no (conditional) independencies on the sub- system are enforced in either model • model at the top has no common root of all three variables  common roots can not be inferred from stochastic independencies alone. Theorem (Inference of Common Roots). Consider n random vari- ables X1,...,Xn taking on values in a ﬁnite set and a number c, 2 ≤ c ≤ n. If the mutual information satisﬁes I (X1,...,Xn) > ` 1 - 1 c - 1 ´ n X i=1 H(Xi), then there exist common roots of order c with positive entropy. Remarks • Reformulation: Common roots of order c can be inferred if Ic := 1 c - 1 n X i=1 H(Xi) - H(X1,...,Xn) > 0. • The entropy of all common roots of order c is at least c-1 n-c+1 Ic. Example (Synchronized States): Assume that X 1 X 2 X n U (1) there are no causal interactions among the components of the observed subsys- tem and (2) the mutual information of the subsystem is maximal and all variables have equal entropy  H(U) ≥ H(X1)= ··· = H(Xn) Example (Maximal Interaction): Distributions of binary variables of the form pa(x1,...,xn) ∼ exp(axn ··· x k ) (a ∈ R,xi ∈ {-1, 1}) can be generated using only common roots of order two. • Result holds also in the algorithmic causal setting introduced in [4] when substituting Kolmogorov complexity for entropy. High mutual information is suﬃcient for common roots, but not necessary. Con- sider n random variables with values in {-1, 1} and pa(x 1 ,...,xn) ∼ exp ` a n X i=1 x i x j ´ with a ∈ R. Future Work 1. How far can one go in characterizing causal models using only entropy- like quantities? 2. Consider the decomposition of mutual information into terms originating from the projection of p onto l-interaction spaces: I (p)= k X l=2 D(p (l) || p (l-1) ) Do causal interpretations for these interaction terms exist? 3. Derive heuristics for causal inference algorithms from information theo- retic results as above. References [1] B. Steudel and N. Ay: Inferring Common Causes from High Multi- Information, submitted, 2008. [2] N. Ay: A Reﬁnement of the Common Cause Principle, SFI Work- ing Paper, 2008. [3] Campos L.: A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests, J. Mach. Learn. Res. Vol. 7, 2006. [4] D. Janzing and B. Schoelkopf: Causal inference using the algorith- mic Markov condition, arxiv:0804.3678, 2008.