Measuring XML document similarity: a case study for evaluating Information Extraction Systems Gerardo Canfora, Luigi Cerulo, Rita Scognamiglio RCOST — Research Centre on Software Technology Department of Engineering - University of Sannio Viale Traiano - 82100 Benevento, Italy {canfora, lcerulo, ritasco}@unisannio.it Abstract Measuring similarity between trees, such as XML struc- tured information, has an important role in many applica- tions, and in particular in the evaluation of the effective- ness of Information Extraction Systems (IES). In this paper we present an experience in evaluating the effectiveness of IES in terms of extraction and adaptation effectiveness. In the ﬁrst part of the paper a similarity measure between XML trees based on a common sub tree detection algorithm is in- troduced; then, a case study aimed at the evaluation of the effectiveness of a group of IES is presented as an example of application. 1. Introduction In many applications the measurement of the similarity, or the distance, between objects is required. Whenever ob- jects are XML documents, the problem of measuring their similarity turns into the problem of computing the similar- ity of trees, which is a particular case of graph matching. Early approaches to graph matching were restricted to ﬁnd- ing graph or subgraph isomorphisms between two graphs [4]. Subgraph isomorphism is useful to check if two objects are the same, or if one object is present in a group of several objects. An index usually used to compute the similarity is the maximum common subgraph of two graphs [15]. The maximum common subgraph of two graphs is a subgraph of both graphs that has, among all those subgraphs, the max- imum number of nodes. The more similar the two graphs are, the larger is their maximum common subgraph. A pow- erful alternative to maximum common subgraph is to use graph edit distance that is an extension of the well known concept of string edit distance to the domain of graphs [14]. It considers a set of graph edit operations, such as a deletion, insertion, or substitution (i.e. label change). Edit operations can be applied to nodes as well as to edges. The edit distance of two graphs is deﬁned as the shortest sequence of edit op- erations that transform one graph to the other. The shorter this sequence is the more similar the two graphs are. In prac- tical applications, some edit operations may have more im- portance than others, hence, a cost is assigned to each indi- vidual edit operation. Recent developments in graph match- ing have shown that there is a direct relationship between graph edit distance and maximum common subgraph in the sense that they are equivalent to each other under certain cost functions [3]. The computation of XML document similarity has been faced in various application contexts and many approaches have been developed and used. Approaches based on plain text have been introduced in [12]. Heuristical approaches have been introduced by Chawathe et al in [6]. XML Diff is a tool that detects structural changes, like a movement of a XML subtree, and produces a Diffgram expressed in a formal language that describes the differences between the two XML documents [2]. A survey of the edit distance be- tween XML trees suitable for various application context approaches is presented in [17]. It is worth noting that ev- ery application context has its own features and needs so that an appropriate measure evidencing speciﬁc similarity characteristics must be deﬁned. In this paper we introduce an XML document similar- ity measure for the purpose of measuring the effectiveness of an IES. Information extraction consists of deriving in- teresting information from unstructured or free text docu- ments (eg. newspaper articles, web pages, etc.) and orga- nizing it in a meaningful way (eg. frame slots, database tu- ple, hierarchical structures) [7]. It has an important role in many application areas, including semantic web for anno- tating html documents and information retrieval to improve document indexing and so on. The extraction rules, usually named wrapper rules, are deﬁned by a formal, semiformal, or natural language, and depending on the wrapper model subsuming the IES the rule generation can be manual [13], Proceedings of the 10th International Symposium on Software Metrics (METRICS’04) 1530-1435/04 $ 20.00 IEEE