Towards Web Information Extraction using Extraction Ontologies and (Indirectly) Domain Ontologies ∗ Martin Labsk ´ y Univ. of Economics, Prague Czech Republic labsky@vse.cz Marek Nekvasil Univ. of Economics, Prague Czech Republic nekvasim@vse.cz Vojt ˇ ech Sv ´ atek Univ. of Economics, Prague Czech Republic svatek@vse.cz ABSTRACT Extraction ontologies allow to swiftly proceed from ini- tial domain modelling to running a functional prototype of a web information extraction application. We inves- tigate the possibility of semi-automatically deriving ex- traction ontologies from third-party domain ontologies. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Informa- tion Search and Retrieval 1. INTRODUCTION Most approaches to web information extraction (WIE) deliver extracted information as somewhat weakly se- mantically structured from the knowledge engineering viewpoint; secondary mapping to ontologies is typically needed, which makes the process complicated and pos- sibly error-prone. Approaches based on extraction on- tologies (EO) [1], in turn, push ontologies more to- wards the actual extraction process through deﬁning the concepts the instances of which are to be extracted in the sense of various attributes, their allowed val- ues and higher level (e.g. cardinality or mutual depen- dency) constraints. EO are assumed to be hand-crafted based on observation of a sample of resources. They allow for rapid start of the extraction process, as even a very simple EO is likely to cover a sensible part of target data and generate meaningful feedback for its own redesign. However, to make maximal use of avail- able data/knowledge and avoid overﬁtting to a few data resources examined by the designer, the whole process must not neglect pre-existing domain ontologies, labelled * (Produces the permission block, copyright information and page numbering). For use with ACM PROC ARTICLE- SP.CLS V2.6SP. Supported by ACM. Copyright ACM ...$5.00 data and HTML formatting regularities. This is the ra- tionale of our WIE tool under development called Ex, which combines richly-structured extraction ontologies with inductive and wrapper-based techniques [2]. Here we investigate the reuse of domain ontologies; the struc- ture of EOs will however be explained ﬁrst. 2. EX(TRACTION) ONTOLOGY CONTENT EOs in Ex are designed so as to extract occurrences of attributes (such as ‘age’ or ‘surname’), i.e. standalone named entities or values, and occurrences of whole in- stances of classes (such as ‘person’) as groups of at- tributes that ‘belong together’. Mandatory information to be speciﬁed for each attribute is: name, data type and dimensionality (e.g. 2 for com- puter monitor resolution like 800x600). Further ex- traction knowledge related to attribute value includes: textual value patterns; for numeric types: min/max values, numeric value distribution and units of mea- sure; min/max value length in tokens or length distri- bution. Extraction knowledge about attribute context includes textual context patterns and formatting con- straints. Nesting of attributes is allowed, their course can be speciﬁed, and external resources of named enti- ties can be referenced. Additional constraints (such as numerical comparisons) can be speciﬁed via JavaScript 1 . Finally, HTML formatting constraints may be provided. Each class deﬁnition enumerates the list of attributes, and for each attribute, a cardinality range. Extraction knowledge for class content consists of: apriori prob- ability of each attribute being included as part of a class instance (as opposed to standalone occurrence), and class content patterns (such as attribute ordering). Extraction knowledge for class context again consists of textual and HTML formatting patterns. All types of extraction knowledge yield pieces of evi- dence indicating the presence of a certain attribute or class instance. Every piece of evidence may be equipped with two probability estimates : precision and recall; they can be estimated from data or set manually. 1 ECMAScript, see http://www.mozilla.org/rhino.