Techniques for Effective Integration, Maintenance and Evolution of Species Databases Andrew C. Jones, Iain Sutherland, Suzanne M. Embury and W. Alex Gray Department of Computer Science Cardiff University PO BOX 916 Cardiff, CF24 3XF, UK Andrew.C.Jones I.Sutherland @cs.cf.ac.uk S.M.Embury W.A.Gray @cs.cf.ac.uk Richard J. White and John S. Robinson Biodiversity & Ecology Research Division School of Biological Sciences University of Southampton Southampton, SO16 7PX, UK R.J.White J.S.Robinson @soton.ac.uk Frank A. Bisby and Sue M. Brandt Biodiversity Informatics Laboratory Centre for Plant Diversity & Systematics The University of Reading Reading RG6 6AS, UK F.A.Bisby S.M.Brandt @reading.ac.uk Abstract The LITCHI project is concerned with the integration and maintenance of databases of biological knowledge organ- ised by species. We use constraints pertaining to good taxonomic practice in order to identify taxonomic conflicts in individual species databases and in databases formed by merging species databases from distinct sources. The LITCHI system can be used to resolve such conflicts incre- mentally. As the project has progressed, we have identified a number of distinctive features of the problem domain, and needs of the intended users, which have had a significant impact on the techniques and modes of operation that we found to be appropriate, especially in contrast with appli- cations that handle rapidly-accumulating ‘raw’ data. It is upon these aspects of LITCHI that we concentrate in the present paper, viewing LITCHI as an example of the more general problem of merging scientific data sets in which conflicts between the terminology used can occur. 1. Introduction Taxonomy may be defined as ‘the study and description of the variation of organisms, the investigation of the causes and consequences of this variation and the manipulation of the data obtained to produce a system of classification’ [16]. In this paper we present the LITCHI 1 system. LITCHI is a tool we have developed with the aim of helping taxonomists to test checklists of scientific names for conflicts and hence (i) to improve the data quality in the taxonomic databases from which the checklists were obtained, and (ii) to provide a basis for integration of taxonomic databases. We provide a survey of the techniques we have had to employ and develop in order to build a tool that achieves these aims. Taxonomy is just one of many scientific areas in which consistent nomenclature is important: other examples in- clude planetary nomenclature [8], geographical nomencla- ture (e.g. [14]) and gene nomenclature (e.g. the SGD Gene Name Registry 2 ). In all these domains it is important to be able to refer to the entities of interest unambiguously. In our system we exploit the fact that it has been possible to de- velop constraints that detect cases in which conflicts occur, by inspection of the scientific names used and the relation- ships between them. In LITCHI we are not concerned with interpretation of vast quantities of fast-accumulating data using, for example, OLAP [5] techniques; rather, we are providing a tool for maintaining a consistent classification scheme in the face of conflicting and changing expert opin- 1 Logic-based Integration of Taxonomic Conflicts in Heterogeneous In- formation systems 2 http://genome-www.stanford.edu/Saccharomyces/gene guidelines.html 0-7695-0686-0/00 $10.00 ã 2000 IEEE