CLARK GLYMOUR, PETER SPIRTES AND RICHARD SCHEINES
IN PLACE OF REGRESSION
ABSTRACT. Assuming an adaptation of Suppes's analysis of causality, we show
that multiple regression methods are fundamentally incorrect procedures for identifying
causes. This is because when regressors are correlated the existence of an unmeasured
common cause of regressor Xi and outcome variable Y may bias estimates of the
influence of other regressors Xk; variables having no influence on Y whatsoever may
thereby be given significant regression coefficients. The bias may be quite ·large.
Simulation studies show that standard regression model specification procedures make
the same error. The strategy of regressing on a larger set of variables and checking
stability may compound rather than remedy the problem. A similar difficulty in the
estimation of the influence of other regressors arises if some Xi is an effect rather than a
cause ofY. The problem appears endemic in uses of multiple regression on uncontrolled
variables, and unless somehow corrected appears to invalidate many scientific uses of
regression methods. We describe an implementation in the TETRAD II program of a
model specification algorithm that avoids these and certain other errors in large samples.
We illustrate the TETRAD II algorithm by applying it to a number of real and simulated
data sets.
The social sciences often use non-experimental or quasi-experimental
data, and multiple regression is the principal tool for causal inference in
such settings. Multiple regression, whether linear or non-linear, is the
preeminent statistical device through which hypotheses are confirmed,
conjectures formed, and policies suggested or justified. Whether you
read about education, human fertility and population growth, the epi-
demiology of pollutants, or almost any other topic of urgent human
concern, you will find data analyzed by regression methods to identify
which variables inflence an outcome variable of interest, and to esti-
mate the strength of those influences. But multiple regression, whether
linear and non-linear and the variety of search procedures that have
developed around it, are fundamentally incorrect methods for the first
purpose, and the results are demonstrably unreliable. The deepest prob-
lem with regression is that it mistakes the connection between causation
and probability; that error cannot be corrected by increased sample
sizes, or by testing for linearity or autocorrelation, or by transforming
339
P. Humphreys (ed.), Patrick Suppes: Scientific Philosopher, Vol. 1, 339-365.
© 1994 Kluwer Academic Publishers.