Real Understanding of Real Estate Forms * Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD firstname.lastname@comlab.ox.ac.uk ABSTRACT Finding an apartment is a lengthy and tedious process. Once decided, one can never be sure not to have missed an even better offer which would have been just one click away. Form understanding is key to automatically access and process all the relevant—and nowadays readily available—data. We introduce opal (ontology-based web pattern analysis with logic), a novel, purely logical approach to web form un- derstanding: opal labels, structures, and groups form fields according to a domain-specific ontology linked through phe- nomenological rules to a logical representation of a DOM. The phenomenological rules describe how ontological con- cepts appear on the web; the ontology formalizes and struc- tures common patterns of web pages observed in a domain. A unique feature of opal is that all domain-independent as- sumptions about web forms are represented in rules, whereas domain-specific assumptions are represented in the ontology. This yields a coherent logical framework, robust in face of changing web trends. We apply opal to a significant, randomly selected sample of UK real estate sites, showing that straightforward rules suffice to achieve high precision form understanding. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: On-line In- formation Services—Web-based services General Terms Languages, Experimentation Keywords form understanding, data extraction, deep web, web page phenomenology * The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007– 2013) / ERC grant agreement no. 246858. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WIMS’11 May 25-027, 2001 Sogndal, Norway. Copyright 2011 ACM 978-1-4503-0148-0/11/05 ...$10.00. 1. INTRODUCTION On the web today there are as many pages, as there are stars in the Milky Way! Through observation and analy- sis we have identified patterns of genesis and behavior that allow us to categorize stars and determine their properties, despite the distances involved. Unfortunately, for web pages the same principled analysis is still in its infancy. In this paper, we focus on the particular problem of web form understanding. Web forms are one of the more deeply investigated aspects of the Web. Their understanding is cru- cial for “deep-web” search engines, for web data-extraction, and for web querying. By “understanding” we primarily mean the identification of forms and form elements, along with their logical organization beyond the asserted HTML structure. Approaches to form understanding in the context of deep web search [11, 15, 9, 14], web querying [16, 2] and in web extraction [13] have focused on observing commonal- ities of general web forms exploiting in specifically tailored algorithms and heuristics. Despite reportedly good perfor- mance, two issues seriously limit their applicability in prac- tice: (1) In all the above approaches, the necessary assump- tions are hard-coded into the implemented algorithms and it is not easy (or even possible) to adapt them. Further- more, in many cases mere parametrization of the heuristics does not suffice for the needed adaptability requirements, especially in an open scenario such as the Web. (2) Try- ing to define general heuristics capable of producing highly precise results in all domains is not an easy task. By gen- eralizing the assumptions made about web forms we are au- tomatically forced to ignore domain-specific patterns that can make a real difference in form understanding for entire classes of web sites. In this paper, we introduce opal, short for ontology-based web pattern analysis with logic. opal uses Prolog rules to explicitly represent assumptions about commonalities of web forms (and other types of web objects). Thus opal allows (1) the declarative definition of the needed assumptions and heuristics through Prolog rules, (2) the specification of mul- tiple sets of rules to be chosen and applied in order to adapt to the situation at hand, and (3) the easy integration of background knowledge (e.g., about the domain, the patterns of web forms, the used vocabulary). We have implemented a prototype system for analyzing real-estate forms in the UK, that exploits background knowledge on the domain (e.g., to distinguish forms for renting and buying properties) and adapts to the observed form type by using different assump- tions. We show that an encoding of those assumptions as