Gathering Metadata from Web-Based Repositories of Historical Publications * Ismael Sanz, Rafael Berlanga and María José Aramburu Departament d'Informàtica Campus Penyeta Roja, Universitat Jaume I, E-12071 Castellón, SPAIN e-mail: {berlanga, aramburu}@inf.uji.es, isanz@guest.uji.es * This work has been partially funded by the Spanish CICYT project TEL97-1119 and BANCAIXA foundation. Abstract In this paper we examine the problem of extracting schema-conforming metadata out from HTML sources. A technique founded on semistructured data analysis is explained. It is based on the combination of HTML styles, which abstract the visual characteristics of documents, and document-oriented context-free grammar, which provide structural information. This technique is flexible enough to be applied not only on individual HTML docuements, but also on hyperlinked web structures. This provides an informed, very controlled way of navigating the repositories. 1. Introduction Designing digital libraries has recently become a very active research area that requires the integration of many issues previously analysed in traditional information system areas. In particular, physical aspects like mass storage and remote access mechanisms, as well as logical aspects, like the organisation and retrieval of the stored information are essential issues for the actual implementation of digital libraries. In this paper we focus on a special kind of digital libraries, namely digital libraries of historical documents. Among them, one can identify newspaper, periodical and patent respositories [4][5]. In this context, a historical document is a structured document whose contents is subject to a time period during which it is regarded as up-to-date. This period will be called in this work the valid time of the document. In such digital libraries, we must distinguish between the current and past issues. The former are those documents whose valid time period is still current, whereas the latter are those documents whose contents has expired out. Some digital libraries (e.g. newspaper servers) only keep current issues into their document repositories. Another feature of these digital libraries is that documents are regularly published. In this way, there exists a strong relationship between the date of the publication and the valid time of a document. On the other hand, publications are subject to a set of well- known edition rules so that they present much regularity in their structures. Nowadays one can find a miryad of general-purpose tools for organising and querying web-based repositories of documents. These tools run the gamut from the simplest web crawlers to the most sophisticated web query languages [1] [2] [3]. Nevertheless, all of these approaches disregard the main features of our digital libraries: the structural regularity and the temporality of the stored documents. This paper presents a networked approach to the problem of periodically extracting and keeping all the relevant information from servers of historical documents. Such an information is called metadata along this paper, since it is data that describes and indexes the documents. As a result, current and historical information included in documents will be distributed along a collection of networked brokers that will assist user requests. The rest of the paper is organised as follows. Section 2 presents the proposed architecture for our digital libraries. Section 3 introduces the data model to represent the extracted metadata from documents. Section 4 presents the implementation of the information gatherers. File System1 File System2 File System3 Web Clients Information Extraction Repository Global Broker Local Broker Gatherer Dig. Library A Dig. Library B Fig. 1: Example of architecture for two Digital Libraries.