R. Meersman, T. Dillon, P. Herrero (Eds.): OTM 2009, Part II, LNCS 5871, pp. 1183–1200, 2009.
© Springer-Verlag Berlin Heidelberg 2009
XML-SIM: Structure and Content Semantic Similarity
Detection Using Keys
Waraporn Viyanon and Sanjay K. Madria
Department of Computer Science, Missouri University of Science and Technology
Rolla, Missouri, USA
wvz7b@mst.edu, madrias@mst.edu
Abstract. This paper describes an approach for the structure and content se-
mantic similarity detection between two XML documents from heterogeneous
data sources using the notion of keys. Comparisons with the previous systems
(XDoI and XDI-CSSK) are presented to show that our new approach has a bet-
ter performance by a big order of magnitude in terms of detection, false-
positives and execution time.
Keywords: XML Similarity Detection, keys, clustering, matching.
1 Introduction
XML has been increasing relevance as a means for exchange information and com-
plex data representation on the Internet [7]. As a matter of fact, different XML
sources may have similar contents, but described using different tag names and struc-
tures such as bibliography data; DBLP [20] and Sigmod Record [1]. Integration of the
similar XML documents from different data sources benefits users to get access to
more complete and useful information.
Since XML documents encode not only structure but also data, to measure accurate
similarities among them for XML document integration, it requires similarity compu-
tation on both structure and content. In most of the matching algorithms, XML docu-
ments are considered as a collection of items represented in XML tree forms. Then
they are fragmented into small independent items representing objects called subtrees.
Next, to find out which subtrees are similar between two XML documents, their subt-
ree similarities are measured in terms of both structure and content. The subtree pairs
having higher similarity than a given threshold are considered as matched pairs which
are finally integrated into one XML document. There are recent works on XML doc-
ument integration such as SLAX [12] and [3, 4, and 6]. Our earlier proposed work on
XML integration techniques are XDoI [18] and XDI-CSSK [17] that outperforms
SLAX [12]. However, in these two approaches, we first find XML content similarity
degrees without taking its structure into account and then comparing the structure
similarity later. It should be rather other way around, comparing content similarity
from the two given subtrees based on their similar structures. The content similarity
degrees are measured by computing common leaf-node values from a subtree in a
base XML document against another subtree in the target XML document. This is a