R. Meersman, T. Dillon, P. Herrero (Eds.): OTM 2009, Part II, LNCS 5871, pp. 1183–1200, 2009. © Springer-Verlag Berlin Heidelberg 2009 XML-SIM: Structure and Content Semantic Similarity Detection Using Keys Waraporn Viyanon and Sanjay K. Madria Department of Computer Science, Missouri University of Science and Technology Rolla, Missouri, USA wvz7b@mst.edu, madrias@mst.edu Abstract. This paper describes an approach for the structure and content se- mantic similarity detection between two XML documents from heterogeneous data sources using the notion of keys. Comparisons with the previous systems (XDoI and XDI-CSSK) are presented to show that our new approach has a bet- ter performance by a big order of magnitude in terms of detection, false- positives and execution time. Keywords: XML Similarity Detection, keys, clustering, matching. 1 Introduction XML has been increasing relevance as a means for exchange information and com- plex data representation on the Internet [7]. As a matter of fact, different XML sources may have similar contents, but described using different tag names and struc- tures such as bibliography data; DBLP [20] and Sigmod Record [1]. Integration of the similar XML documents from different data sources benefits users to get access to more complete and useful information. Since XML documents encode not only structure but also data, to measure accurate similarities among them for XML document integration, it requires similarity compu- tation on both structure and content. In most of the matching algorithms, XML docu- ments are considered as a collection of items represented in XML tree forms. Then they are fragmented into small independent items representing objects called subtrees. Next, to find out which subtrees are similar between two XML documents, their subt- ree similarities are measured in terms of both structure and content. The subtree pairs having higher similarity than a given threshold are considered as matched pairs which are finally integrated into one XML document. There are recent works on XML doc- ument integration such as SLAX [12] and [3, 4, and 6]. Our earlier proposed work on XML integration techniques are XDoI [18] and XDI-CSSK [17] that outperforms SLAX [12]. However, in these two approaches, we first find XML content similarity degrees without taking its structure into account and then comparing the structure similarity later. It should be rather other way around, comparing content similarity from the two given subtrees based on their similar structures. The content similarity degrees are measured by computing common leaf-node values from a subtree in a base XML document against another subtree in the target XML document. This is a