Open Boek: a system for the extraction of numeric data from archeological reports H. Paijmans and S. Wubben Tilburg University RACM Amersfoort paai@uvt.nl, S.Wubben@uvt.nl June 20, 2007 Abstract This paper describes the current state of the Open Boek information retrieval system for natural language archaeological papers and reports in the Dutch language. The system focuses on the recognition of phrases that contain chronological and geographical references. Memory Based Learning is applied to assign the correct classes to such phrases; after that indexes are created for later retrieval and optionally tags are inserted to perform appropriate actions such as linking to Google Maps. In this paper refinements to the original modus operandi and new problems and solutions are described. 1 Introduction In archaeology, as in all other cultural heritage do- mains, most knowledge is stored in articles and books, i.e. natural language text. Retrieval of facts and information from such texts is notoriously diffi- cult, unless we confine ourselves to keywords. This means that we can search (on Google, for instance) for the string ’middle ages’, and retrieve all docu- ments in which those words occur. But Google does not know that ’eleventh century’, ’1300-1412’, ’XIIth century’ and countless other chronological expres- sions would all be relevant for somebody who is in- terested in the middle ages. In our system, Open Boek, we address these and related problems. Elsewhere [Paijmans and Wubben2007] we de- scribed the principles and constraints that governed our approach to the problem of information re- trieval in dutch archaeological texts. Essentially we do not try to create or use a general ontology or other such ’grand design’ such as CIDOC/CRM, to interpret the contents of a text, but we try to solve the recognition of each semantic class on its own merits, strive for a satisfactory performance and then go on to the next class. This is not to mean that ontologies such as CIDOC/CRM are not useful for structuring data, to impose a common standard or even as a tool for extracting data from NL (Natural Language) docu- ments; see also [Généreux and Niccolucci2006]. But at this stage we feel that the community is better served with immediate solutions, that in their turn may suggest ways and means for more involved ap- proaches and ontologies. And of course the indexes that we generate can at any time be translated to XML or included in a database. As a case, let us consider an institution such as the RACM 1 , where a large number of papers and reports about archaeological excavations, site sur- veys and similar documents are stored digitally. Ac- cess to the information in the reports is by a col- lection of separate databases in which relevant at- tributes of the documents are entered by human op- erators, by straightforward scanning for keywords, or, sometimes, by a rudimentary keyword index. Although there is traditionally much activity in the archaeological world in the field of typologies and controlled dictionaries, and although there is an ur- gent need for so-called ’reference collections’ that support such typologies [Lange2004], there is no agreement on how to apply such typologies to in- formation retrieval. This is typical for the archae- ological scene; there exist projects to create a more involved XML markup for documents, based on CIDOC/CRM, e.g, by J. Holmen and his collabora- tors, [Holmen et al.2003], but here no automatic ex- traction from instances in the text into the tags is en- visaged, and the current status of the project is not clear. The needs of the archaeologist are concisely ex- 1 Rijksdienst voor Archeologie, Cultuurlandschap en Monumenten, the central authority that collects data and monitors archeological activity in the Netherlands