A FORMAL DEFINITION OF DATA QUALITY PROBLEMS (Completed paper) Paulo Oliveira DI/gEPL – Languages Specification and Processing Group University of Minho (Portugal), and GECAD/ISEP-IPP – Knowledge Engineering and Decision Support Group Institute of Engineering – Polytechnic of Porto (Portugal) pjo@isep.ipp.pt Fátima Rodrigues GECAD/ISEP-IPP - Knowledge Engineering and Decision Support Group Institute of Engineering – Polytechnic of Porto (Portugal) mfc@isep.ipp.pt Pedro Henriques DI/gEPL – Languages Specification and Processing Group University of Minho (Portugal) prh@di.uminho.pt Abstract: The exploration of data to extract information or knowledge to support decision making is a critical success factor for an organization in today’s society. However, several problems can affect data quality. These problems have a negative effect in the results extracted from data, affecting their usefulness and correctness. In this context, it is quite important to know and understand the data problems. This paper presents a taxonomy of data quality problems, organizing them by granularity levels of occurrence. A formal definition is presented for each problem included. The taxonomy provides rigorous definitions, which are information-richer than the textual definitions used in previous works. These definitions are useful to the development of a data quality tool that automatically detects the identified problems. Key Words: Data Quality Problems, Formal Definition, Taxonomy 1. INTRODUCTION Nowadays, public and private organizations understand the value of data. Data is a key asset to improve efficiency in today’s dynamic and competitive business environment. However, as organizations begin to create integrated data warehouses for decision support, the resulting Data Quality (DQ) problems become painfully clear [12]. A study by the Meta Group revealed that 41% of the data warehouse projects fail, mainly due to insufficient DQ, leading to wrong decisions [8]. The quality of the input data strongly influences the quality of the results [15] (“garbage in, garbage out” principle). The concept of DQ is vast, comprising different definitions and interpretations. DQ is essentially studied in two research communities: databases and management. The first one studies DQ from a technical point of view (e.g., [4]), while the second one is also concerned with other aspects or dimensions (e.g., accessibility, believability, relevancy, interpretability, objectivity) involved in DQ (e.g., [13, 17]). In the context of this paper we follow the databases perspective, i.e., DQ means just the quality of the data values or instances.