Information Extraction from Tree Documents by Learning Subtree Delimiters Boris Chidlovskii Xerox Research Centre Europe, France 6, chemin de Maupertuis, F–38240 Meylan, chidlovskii@xrce.xerox.com Abstract Information extraction from HTML pages has been conventionally treated as plain text documents ex- tended with HTML tags. However, the growing maturity and correct usage of HTML/XHTML for- mats open an opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to learn accurate extraction rules. In this paper, we generalize the notion of delimiter developed for the string information extraction to tree documents. Similar to delimiters in strings, we define delim- iters in tree documents as subtrees surrounding the text leaves. We formalize the wrapper induction for tree documents as learning the classification rules based on the subtree delimiters. We analyze a re- stricted case of subtree delimiters in the form of simple paths. We design an efficient data structure for storing candidate delimiters and an incremental algorithm for finding most discriminative subtree delimiters for the wrapper. 1 Introduction The immensity of Web data valuable for various human needs has led to research on information extraction from the Web, with the wrapper learning from annotated samples being one of major research trends. Since the first wrappers [10] crafted for a specific structure of Web pages, wrapper classes have grown in their expressive power and capacity to adopt struc- tural variations. While the further empowering the wrapper learning methods and their combinations remains crucial for developing flexible IE systems, another important goal raises in the controlled reduction of the sample annotation. The learning from both labeled and unlabeled samples appears, in the case of the wrapper learning, as the learning from par- tially annotated Web pages, where the annotation of items in a page is integrated with the learning in an interaction system and driven by the learning bias and accuracy requirements. Over last 10 years, the HTML format has seen several evo- lutionary changes and has achieved a maturity level with a wider use of XHTML/XML for publishing the Web content. In November 2002, we have analyzed HTML pages from 32 sites we have being tracked since 1998 (360 to 420 pages per year). The analysis has discovered a tendency toward cleaner pages and richer tag context around content elements. First, the nesting error ratio expressed as the percentage of miss- ing and mismatching end tags in the HTML files has almost halved, from 6.7% in 1998 to 3.9% in 2002. Second, the aver- age number of HTML tags surrounding a content element has increased by 31%, from 5.1 tags per content element in 1998, to 6.7 tags in 2002. Additionally, the ratio of tag attributes has increased by 26%, from 0.34 attribute per tag in 1998 to 0.43 in 2002. Although it seems very natural considering Web pages as trees, the majority of the wrapper learning methods treat HTML pages as sequences of tokens where text tokens are interleaved with tags. Information extraction from strings often follows the finite-state methodology with two alter- native approaches seen as the global and the local view at the extraction problem. The local view approach stems from the information extraction from unstructured and semi- structured text [5], when a wrapper is an enhancement of a basic HTML parser with a set of extraction rules; an ex- traction rule has often a form of delimiters (landmarks) [9; 11] that are sequences of tags preceding (or following) an element to be extracted; for example, delimiter <td><a> re- quires a text token to be preceded by tags <td> and <a>. The global view approach assumes that HTML pages are instances of an unknown language and attempts to identify this language. In the case of deterministic automata, it de- termine the automata structure by generalization from the training examples; in the case of weighted automata/HMM, it learns the transition probabilities. To accommodate the infor- mation extraction, these methods either enhance finite-state automata with extraction rules [4] or adopt the formalism of finite-state transducers [2; 7]. The global view approach benefits from the grammatical inference methods that can learn finite-state automata and transducers from positive examples; however they often re- quire many annotated samples to achieve a reasonable gener- alization. On the other hand, in the local view, using local de- limiters in a context-less manner limits the expressive power of the delimiter-based wrappers. To combine the advantages of the two approaches, [2] has extended the notion of delim- iter to previously labeled text tokens. For example, delim- iter PC(none)<td><a> requires that a current text token is preceded by a text token labeled as none (skipped) and tags <a> and <td>. As result, the wrapper learning algorithm