Data Management for Causal Algorithmic Fairness ∗ Babak Salimi ∗ , Bill Howe † , Dan Suciu ∗ University of Washington ∗ {bsalimi,suciu}@cs.washington.edu, † billhowe@uw.edu Abstract Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reﬂects discrimination, suggesting a data management problem. In this paper, we ﬁrst make a distinction between associational and causal deﬁnitions of fairness in the literature and argue that the concept of fairness requires causal reasoning. We then review existing works and identify future opportunities for applying data management techniques to causal algorithmic fairness. 1 Introduction Fairness is increasingly recognized as a critical component of machine learning (ML) systems. These systems are now routinely used to make decisions that affect people’s lives [11], with the aim of reducing costs, reducing errors, and improving objectivity. However, there is enormous potential for harm: The data on which we train algorithms reﬂects societal inequities and historical biases, and, as a consequence, the models trained on such data will therefore reinforce and legitimize discrimination and opacity. The goal of research on algorithmic fairness is to remove bias from machine learning algorithms. We recently argued that the algorithmic fairness problem is fundamentally a data management problem [43]. The selection of sources, the transformations applied during pre-processing, and the assumptions made during training are all sensitive to bias that can exacerbate fairness effects. The goal of this paper is to discuss the application of data management techniques in algorithmic fairness. In Sec 2 we make a distinction between associational and causal deﬁnitions of fairness in the literature and argue that the concept of fairness requires causal reasoning to capture natural situations, and that the popular associational deﬁnitions in ML can pro- duce misleading results. In Sec 3 we review existing work and identify future opportunities for applying data management techniques to ensure causally fair ML algorithms. 2 Fairness Deﬁnitions Algorithmic fairness considers a set of variables V that include a set of protected attributes S and a response variable Y , and a classiﬁcation algorithm A : Dom(X) → Dom(O), where X ⊆ V, and the result is denoted Copyright 2019 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering * This work is supported by the National Science Foundation under grants NSF III-1703281, NSF III-1614738, NSF AITF 1535565 and NSF award #1740996. 1 arXiv:1908.07924v3 [cs.DB] 1 Oct 2019