Semantic-Based Access to Digital Document Databases F. Esposito, S. Ferilli, T.M.A. Basile, and N. Di Mauro Dipartimento di Informatica, University of Bari, Italy {esposito, ferilli, basile, nicodimauro}@di.uniba.it Abstract. Discovering significant meta-information from document col- lections is a critical factor for knowledge distribution and preservation. This paper presents a system that implements intelligent document pro- cessing techniques, by combining strategies for the layout analysis of electronic documents with incremental first-order learning in order to automatically classify the documents and their layout components ac- cording to their semantics. Indeed, an in-deep analysis of specific layout components can allow the extraction of useful information to improve the semantic-based document storage and retrieval tasks. The viability of the proposed approach is confirmed by experiments run in the real- world application domain of scientific papers. 1 Introduction Since having documents in electronic form makes their management significantly easier, much research in the last years looked for approaches to handle the huge amount of legacy documents in paper format according to the semantics of their components [8]. Conversely, almost all documents nowadays are generated di- rectly in digital format, and stored in distributed repositories whose main con- cerns and problems consist in the acquisition and organization of the informa- tion contained therein. Manually creating and maintaining an updated index is clearly infeasible, due to the potentially huge amount of data to be handled, tagged and indexed. Hence a strong motivation for the research concerned with methods that can provide solutions for automatically acquiring new knowledge. This paper deals with the application of intelligent techniques to the man- agement of a collection of scientific papers on the Internet, aimed at automati- cally extracting from the documents significant information, useful to properly store and retrieve them. In this application domain, to identify the subject and context of a paper, an important role is played by components such as title, authors, abstract and bibliographic references. Three processing stages are typ- ically needed to identify a document significant components: Layout Analysis, Document Classification and Document Understanding. We propose to exploit Machine Learning techniques to carry out the last two steps. In particular, the need for expressing relations among layout components requires the use of sym- bolic first-order techniques, while the continuous flow of new document calls for incremental abilities that can revise a faulty knowledge previously acquired. M.-S. Hacid et al. (Eds.): ISMIS 2005, LNAI 3488, pp. 373–381, 2005. c Springer-Verlag Berlin Heidelberg 2005