13 A Model-Based Approach for Developing Data Cleansing Solutions MARIO MEZZANZANICA, ROBERTO BOSELLI, MIRKO CESARINI, and FABIO MERCORIO, Department of Statistics and Quantitative Methods, C.R.I.S.P. Research Centre, University of Milano-Bicocca, Italy The data extracted from electronic archives is a valuable asset; however, the issue of the (poor) data quality should be addressed before performing data analysis and decision-making activities. Poor data quality is frequently cleansed using business rules derived from domain knowledge. Unfortunately, the process of designing and implementing cleansing activities based on business rules requires a relevant effort. In this article, we illustrate a model-based approach useful to perform inconsistency identification and corrective interventions, thus simplifying the process of developing cleansing activities. The article shows how the cleansing activities required to perform a sensitivity analysis can be easily developed using the proposed model-based approach. The sensitivity analysis provides insights on how the cleansing activities can affect the results of indicators computation. The approach has been successfully used on a database describing the working histories of an Italian area population. A model formalizing how data should evolve over time (i.e., a data consistency model) in such domain was created (by means of formal methods) and used to perform the cleansing and sensitivity analysis activities. Categories and Subject Descriptors: H.2.m [Database Management]: Miscellaneous; D.2.4 [D.2.4 Soft- ware/Program Verification]: Formal methods; Model checking General Terms: Algorithms, Verification, Theory Additional Key Words and Phrases: Data quality, data consistency, data verification, ETL, data believability ACM Reference Format: Mario Mezzanzanica, Roberto Boselli, Mirko Cesarini, and Fabio Mercorio. 2015. A model-based approach for developing data cleansing solutions. ACM J. Data Inf. Qual. 5, 4, Article 13 (February 2015), 28 pages. DOI: http://dx.doi.org.10.1145/2641575 1. INTRODUCTION In the last decade, information systems usage has grown apace, fostered by the avail- ability of several Information Communication Technology (ICT)-based services. Hence, a lot of data have been collected from business, social, and governmental transactions, which can deeply describe the ongoing relations among people, public institutions, and organizations. Such data could be used to accurately analyse social, economic, and business phenomena, and to assess decision-making activities like the evaluation of active policies, the allocation of resources, and the design and improvement of (public) services. A preliminary version of this work appears in M. Mezzanzanica, R. Boselli, M. Cesarini, and F. Mercorio, “Data Quality through Model Checking Techniques,” in Proceedings of Intelligent Data Analysis (IDA), Lecture Notes in Computer Science, Vol. 7014, 2011, pp. 270–281. Authors’ addresses: M. Mezzanzanica, R. Boselli, M. Cesarini, and F. Mercorio, University of Milano-Bicocca, Department of Statistics and Quantitative Methods. Via Bicocca degli Arcimboldi 8, II floor. I-20126 Milano, Italy; emails: {mario.mezzanzanica, roberto.boselli, mirko.cesarini, fabio.mercorio}@unimib.it. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2015 ACM 1936-1955/2015/02-ART13 $15.00 DOI: http://dx.doi.org.10.1145/2641575 ACM Journal of Data and Information Quality, Vol. 5, No. 4, Article 13, Publication date: February 2015.