Optimally Predictive Causal Inference Susanne Still sstill@hawaii.edu Information and Computer Sciences University of Hawaii at Manoa Honolulu, HI 96822, USA. James P. Crutchﬁeld chaos@cse.ucdavis.edu and Christopher J. Ellison cellison@cse.ucdavis.edu Complexity Sciences Center and Physics Department, University of California at Davis One Shields Avenue, Davis, CA 95616, USA. Editor: TBD Abstract Natural systems compute intrinsically and produce information. The organization of a stochastic dynamical system is reﬂected in the time series of observations made of the system and can be quantiﬁed by the excess entropy or predictive information —the mutual information between past and future. This information can be used to build models of vary- ing complexity that capture the causal structure of the underlying system. Here we study two distinct cases of causal inference, which we call optimal causal ﬁltering and optimal causal estimation. Optimal causal ﬁltering corresponds to the ideal case in which inﬁnite data are available. We show that, in the limit in which a model complexity constraint is relaxed, the ﬁltering method ﬁnds the causal architecture of a stochastic dynamical sys- tem, known as the causal state partition. In that limit, it reconstructs exactly the system’s hidden, causal states. More generally, it ﬁnds a graded model-complexity hierarchy of ap- proximations to the causal architecture. For nonideal cases with ﬁnite data, we show how the correct number of underlying causal states can be found by optimal causal estimation. A previously derived model complexity control term allows us to correct for the eﬀect of statistical ﬂuctuations in probability estimates and thereby avoid over-ﬁtting. 1. Introduction Time series modeling has a long and important history in science and engineering. Ad- vances in dynamical systems over the last half century led to new methods that attempt to account for the inherent nonlinearity in many natural information sources (Strogatz, 1994). As a result, it is now well known that nonlinear systems produce highly correlated time se- ries that are not adequately modeled under the typical statistical assumptions of linearity, independence, and identical distributions. One consequence, exploited in novel state-space reconstruction methods (Packard et al., 1980), is that discovering the hidden structure of such processes is key to successful modeling and prediction (Kantz and Schreiber, 2006). Following these lines, here we investigate the problem of learning predictive models of time series with particular attention paid to discovering hidden variables. We do this by using the information bottleneck method (IB) (Tishby et al., 1999) together with a 1