25 Improving Data Quality by Leveraging Statistical Relational Learning LARYSA VISENGERIYEVA, Technische Universit¨ at Berlin, Germany ALAN AKBIK, IBM Research - Almaden, USA MANOHAR KAUL, IIT Hyderabad, India TILMANN RABL, Technische Universit¨ at Berlin, Germany VOLKER MARKL, Technische Universit¨ at Berlin, Germany Digitally collected data suffers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order logic directly translate into the predictive model in our SRL framework. 1. INTRODUCTION Having access to high quality data is of great importance in data analysis. However, data in the real world is often considered dirty: it contains inaccurate, incomplete, inconsistent, duplicated, or stale values. A number of distinct data quality issues are known in the field of data quality management such as data consistency, currency, accuracy, deduplication and information completeness [Fan and Geerts 2012]. As previous work has observed, such data quality issues are detrimental to data analysis [Council 2013],[Fan and Geerts 2012] and cause huge costs to businesses [Eckerson 2002]. Therefore, improving data quality with respect to business and integrity constraints is a crucial component of data management. A common approach to increase data quality is to formulate a set of data cleaning rules that detect semantic errors by utilizing data dependencies [Fan and Geerts 2012], [Arasu et al. 2009], [Dallachiesa et al. 2013], [Geerts et al. 2013]. However, previous research identified a number of requirements and accompanying challenges, which are associated with creating such rule sets (c.f., Section 2): Interleaved rules. First, while each such rule usually addresses one data quality issue individually, the individual rules as a whole typically interact [Fan and Geerts 2012], [Fan et al. 2014]. For instance, a rule that deletes duplicates might perform better after missing data has already been imputed, while, on the other hand, a rule that imputes missing data might perform better if duplicates have already been removed. Therefore, we argue to model data quality rules such as deduplication and missing value imputation jointly, rather than as separate processes. Second, rules in such a rule-set may need to be modeled ”soft” and ”hard” in order to balance constraints of different importance [Yakout et al. 2013], especially within a set of interacting rules. Author’s email: L.Visengeriyeva larysa.visengeriyeva@tu-berlin.de; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. c 2016 ACM. 1544-3558/2016/01-ART25 $15.00 DOI: 0000001.0000001 Paper 25, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016.