Enabling Ontology-based Document Classification and Management in ebXML Registries Alessio Bechini University of Pisa Via Diotisalvi, 2 56122 Pisa +390502217554 alessio.bechini@iet.unipi.it Andrea Tomasi University of Pisa Via Diotisalvi, 2 56122 Pisa +390502217560 andrea.tomasi@iet.unipi.it Jacopo Viotto University of Pisa Via Diotisalvi, 2 56122 Pisa +390502217467 jacopo.viotto@iet.unipi.it ABSTRACT Document Management Systems (DMSs) are a key component in modern enterprises. For successful document search and retrieval, an adequate metadata set should be defined in order to describe documents with sufficient detail. However, often a single metadata set is not sufficient throughout the whole DMS, as different document types require different attributes to be properly characterized. In this paper, we introduce ontologies as a modeling technology for structured metadata definition within DMSs. Focusing on the ebXML registry standard, we show an approach to enhance DMSs for semantic content management and then we propose a method to exploit this new capability for automated document characterization. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – retrieval models, search process I.7.1 [Document and Text Processing]: Document and Text Editing – document management J.1 [Administrative Data Processing]: Business, Government General Terms Algorithms, Management. Keywords Document management system, ebXML registry, ontology. 1. INTRODUCTION It is widely understood that efficient management of the enterprise knowledge can noticeably boost up the productivity of a modern company. The overwhelming number of information sources currently available frequently becomes a problem, rather than an opportunity. Companies need a well-organized knowledge repository in order to reuse information throughout the enterprise, and consequently reduce costs and response times. Document Management Systems (DMSs) are the standard solution to knowledge management requirements within enterprises. A typical DMS consists of a repository containing the actual documents, and an engine working on top of it, offering functions for storing, searching and retrieving documents, as well as advanced features such as versioning and access control. Documents are annotated with metadata, describing properties like author, format, keywords and so on. Metadata are heavily used in document searches (much more frequently than content, because metadata are less tied to syntax and closer to semantics), and since searching is the most important function in this kind of systems, it becomes clear that efficient knowledge management can be enabled by a well-structured metadata model. Unfortunately, no standard schema currently exists for satisfactory document characterization. The most notable effort, the well- known Dublin Core metadata set [10][32], was explicitly designed to be minimal, in order to suit a wide range of applications. The Dublin Core schema does not accommodate the creation of complex user-defined structures, thus it lacks the flexibility often required in a number of application fields. This need for complexity arises when observing that different document types need different metadata sets to be fully qualified: for instance, an internal report may reference the originating department, but this latter piece of information would be meaningless in relation to a newspaper article. This problem has grown even worse in recent years, due to the current trend of using DMSs like generic Content Management Systems (CMSs), containing not only textual documents but multimedia files as well. 2. EVOLUTION OF DMSs AND ONTOLOGIES Since we cannot define a straight-forward “one-size-fits-all” metadata set, we need a more complex model to enable DMSs to manage metadata in a structured way, such as classifying similar documents using the same properties and taking advantage of hierarchies to express different levels of detail and abstraction. The description of entities, properties and their relations is exactly the focus of research about ontologies and related technologies. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08, March 16-20, 2008, Fortaleza, Ceará, Brazil. Copyright 2008 ACM 978-1-59593-753-7/08/0003…$5.00. 1145