Editing Rules: Discovery and Application to Data Cleaning Thierno Diallo # ∗ , Jean-Marc Petit # , and Sylvie Servigne # # LIRIS UMR 5205 CNRS/Universit´ e de Lyon, INSA de Lyon, bˆ atiment B. Pascal 20, Avenue Albert Einstein - 69622 Villeurbanne cedex ∗ Orchestra Networks SA 11 rue Scribe 75009 Paris-France ﬁrstname.lastname@insa-lyon.fr Abstract. Dirty data is a serious problem for businesses, leading to incorrect decision making, ineﬃcient daily operations, and ultimately wasting both time and money. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repairing methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even introduce new errors. Based on Master Data Management principles, a new class of data quality rules known as Editing Rules (eR) tells how to ﬁx errors, pointing which attributes are wrong and what values they should take. However, ﬁnding data quality rules is an expensive process that involves intensive manual eﬀorts. In this paper, we develop pattern mining tech- nics for discovering eRs from existing source relations (eventually dirty) with respect to master relations (supposed to be clean and accurate). In this setting, we propose a new semantic of eRs taking advantage of both source and master data. The problem turns out to be strongly related to the discovery of both CFD and one-to-one correspondences between sources and target attributes. We have proposed eﬃcient technics to ad- dress these two subproblems. We have implemented and evaluated our technics on real-life databases. Experiments show both the feasibility, the scalability and the robustness of our proposition. 1 Introduction Poor data quality continues to be an important issue for companies. Erroneous, incomplete or duplicate data leads to bad and poor business decisions which cost a lot. There is an increased need for eﬀective methods to improve data quality and to restore consistency. A variety of integrity constraints have been studied for data cleaning from traditional functional and inclusion dependencies to their conditional extensions. [6, 7, 14, 17]. These constraints help us to determine whether errors are present in the data but they fall short of telling us which attributes are concerned by