Journal of Integrative Bioinformatics 2006 http://journal.imbio.de/ Data Cleaning and Semantic Improvement in Biological Databases Daniele Apiletti, Giulia Bruno, Elisa Ficarra, Elena Baralis Dep. of Control and Computer Engineering (DAUIN), Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Torino, Italy Summary Public genomic and proteomic databases can be affected by a variety of errors. These errors may involve either the description or the meaning of data (namely, syntactic or semantic errors). We focus our analysis on the detection of semantic errors, in order to verify the accuracy of the stored information. In particular, we address the issue of data constraints and functional dependencies among attributes in a given relational database. Constraints and dependencies show semantics among attributes in a database schema and their knowledge may be exploited to improve data quality and integration in database design, and to perform query optimization and dimensional reduction. We propose a method to discover data constraints and functional dependencies by means of association rule mining. Association rules are extracted among attribute values and allow us to find causality relationships among them. Then, by analyzing the support and confidence of each rule, (probabilistic) data constraints and functional dependencies may be detected. With our method we can both show the presence of erroneous data and highlight novel semantic information. Moreover, our method is database-independent because it infers rules from data. In this paper, we report the application of our techniques to the SCOP (Structural Classification of Proteins) and CATH Protein Structure Classification databases. 1 Introduction It’s about thirty years that biological data are generated from a variety of biomedical devices and stored at an increasing rate in public repositories. Recently a big effort has been made to integrate distributed heterogeneous databases, where researchers continuously store their new experimental results. However data quality improvement is one of the foremost tasks to perform, since the accuracy of data analysis and the ability to produce correct results from data mining relies on it. Public biological databases can be affected by a variety of errors, which may involve either the description or the meaning of information. The existence of such erroneous or poor data harmfully affects any further elaboration or application. Typically addressed data quality problems can be divided into two categories: syntactic anomalies and semantic anomalies. Among syntactic anomalies there are the problems of incompleteness (due to the lack of attribute values), inaccuracy (due to the presence of errors and outliers), lexical errors, domain format errors and irregularity. Among semantic anomalies there are discrepancy, due to a conflict between some attribute values (i.e. age and date of birth), ambiguity, due to the presence of synonyms, homonyms or abbreviations, redundancy, due to the presence of duplicate information, inconsistency, due to an integrity constraint violation (i.e. the attribute age must be a value grater than 0) or a functional constraint violation (i.e. if the attribute married is false, the attribute wife must be null), Copyright 2006 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).