Computers & Geosciences 32 (2006) 1368–1377 The problem of missing data in geoscience databases Stephen Henley à Resources Computing International Ltd., 185 Starkholmes Road, Matlock, Derbyshire DE4 5JA, UK Received 14 September 2005; received in revised form 5 December 2005; accepted 16 December 2005 Abstract SQL is the (more or less) standardised language that is used by the majority of commercial database management systems. However, it is seriously flawed, as has been documented in detail by Date, Darwen, Pascal, and others. One of the most serious problems with SQL is the way it handles missing data. It uses a special value ‘NULL’ to represent data items whose value is not known. This can have a variety of meanings in different circumstances (such as ‘inapplicable’ or ‘unknown’). The SQL language also allows an ‘unknown’ truth value in logical expressions. The resulting incomplete three-valued logic leads to inconsistencies in data handling within relational database management systems. Relational database theorists advocate that a strict two-valued logic (true/false) be used instead, with prohibition of the use of NULL, and justify this stance by assertion that it is a true representation of the ‘real world’. Nevertheless, in real geoscience data there is a complete gradation between exact values and missing data: for example, geochemical analyses are inexact (and the uncertainty should be recorded); the precision of numeric or textual data may also be expressed qualitatively by terms such as ‘approximately’ or ‘possibly’. Furthermore, some data are by their nature incomplete: for example, where samples could not be collected or measurements could not be taken because of inaccessibility. It is proposed in this paper that the best way to handle such data sets is to replace the closed-world assumption and its concomitant strict two-valued logic, upon which the present relational database model is based, by the open-world assumption which allows for other logical values in addition to the extremes of ‘true’ and ‘false’. Possible frameworks for such a system are explored, and could use Codd’s ‘marks’, Darwen’s approach (recording the status of information known about each data item), or other approaches such as fuzzy logic. r 2006 Elsevier Ltd. All rights reserved. Keywords: Relational database; Open-world assumption; Closed-world assumption; Missing data; SQL; Logic; Fuzzy logic 1. Introduction Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know—Donald Rumsfeld, US Secretary of Defense In the geosciences (as in other observational sciences, and as in military intelligence), it is very common for observational data sets to be incom- plete. Data items may be missing altogether, or they may be imprecise in one way or another. There are many things we do not know—and many more that we do not know we do not know. ARTICLE IN PRESS www.elsevier.com/locate/cageo 0098-3004/$ - see front matter r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2005.12.008 à Tel.: +44 1629 581454. E-mail address: stephen.henley@resourcescomputing.com.