Knowledge Discovery and Management Laboratory Flinders Institute for Research in Science and Technology Flinders University of South Australia Technical Report KDM-01-001 Originally released October 2001, Revised November 2002 A Unifying Semantic Distance Measure for Determining the Similarity of Attribute Values John F. Roddick 1 , Kathleen Hornsby 2 and Denise de Vries 1 1 School of Informatics and Engineering PO Box 2100, Adelaide 5001, South Australia. Email: {roddick,denise.devries}@infoeng.flinders.edu.au 2 National Centre for Geographic Information and Analysis, University of Maine, Orono, Maine 04469-5711, USA. Email: khornsby@spatial.maine.edu Abstract The relative difference between two data values is of interest in a number of application domains including temporal and spatial applications, schema versioning, data warehousing (particularly data preparation), in- ternet searching, validation and error correction, and data mining. Moreover, consistency across systems in determining such distances and the robustness of such calculations is essential in some domains and useful in many. Despite this, there is no generally adopted ap- proach to determining such distances and no accom- modation of distance within SQL or any commercially available DBMS. For non-numeric data values calculating the dif- ference between values often requires application- specific support but even for numeric values the prac- tical distance between two values may not simply be their numeric difference or Euclidean distance. In this paper, a model of semantic distance is developed in which a graph-based approach is used to quantify the distance between two data values. The approach facilitates a notion of distance, both as a simple traversal distance and as weighted arcs. Transition costs, as an additional expense of passing through a node, are also accommodated. Further- more, multiple distance measures can be incorporated and a method of ‘localisation’ is discussed which al- lows relevant information to take precedence over less relevant information. Some results from our investi- gations, including our SQL based implementation, are presented. Keywords: Semantic distance, difference mea- sures, similarity. This paper will appear in the Proceedings of the 26th Australasian Computer Science Conference, Adelaide, Australia, February 2003. Michael Oudshoorn, Ed. Conferences in Research and Practice in Information Technology, Volume 16. ACS. Kathleen Hornsby’s work is partially supported by a grant from the National Imagery and Mapping Agency, NMA201-00-1-2009. 1 Introduction In most applications, determining the relative dis- tance between two objects through an inspection of the values of selected attributes is an important func- tion. For simple numeric domains, this does not of- ten cause a significant problem. However, for non- numeric or non-planar numeric domains, even those that are enumerated, this requires application-specific support. Despite this, there is no generally adopted approach to determining semantic distance and there is currently no accommodation of distance within SQL or any commercially available DBMS. We use the term semantic distance to refer to the notion of relative or useful (as opposed to lexicographical, lin- guistic or physical) distance between concepts. The kinds of application requiring such support vary widely and include: • temporal and spatial applications, in which the quickest route may not be the shortest or cheap- est and vice versa, • schema versioning, in which data stored under one protocol must be comparable with new data stored under a later protocol, • data warehousing, in which the summarisation and cleaning of data may be achieved more ef- ficiently through the clustering of objects with similar values, • search engines, in which the entered keywords might only be indicative of the useful keywords to use when searching, i.e. a query using the keywords Venezuela and Duck might also be in- terested in articles which mention the Orinoco Goose, • validation and error correction, where the close- ness of an attribute’s value to a predefined set of values may require checking and/or correction, and • data mining, in which the proximity of objects or the extraction of rules about clustered objects may be required.