HarvANA – Harvesting Community Tags to Enrich Collection Metadata Jane Hunter, Imran Khan, Anna Gerber University of Queensland St Lucia, Queensland, Australia (617) 33654311 {jane, imrank, agerber}@itee.uq.edu.au ABSTRACT Collaborative, social tagging and annotation systems have exploded on the Internet as part of the Web 2.0 phenomenon. Systems such as Flickr, Del.icio.us, Technorati, Connotea and LibraryThing, provide a community-driven approach to classifying information and resources on the Web, so that they can be browsed, discovered and re-used. Although social tagging sites provide simple, user-relevant tags, there are issues associated with the quality of the metadata and the scalability compared with conventional indexing systems. In this paper we propose a hybrid approach that enables authoritative metadata generated by traditional cataloguing methods to be merged with community annotations and tags. The HarvANA (Harvesting and Aggregating Networked Annotations) system uses a standardized but extensible RDF model for representing the annotations/tags and OAI-PMH to harvest the annotations/tags from distributed community servers. The harvested annotations are aggregated with the authoritative metadata in a centralized metadata store. This streamlined, interoperable, scalable approach enables libraries, archives and repositories to leverage community enthusiasm for tagging and annotation, augment their metadata and enhance their discovery services. This paper describes the HarvANA system and its evaluation through a collaborative testbed with the National Library of Australia using architectural images from PictureAustralia. Categories and Subject Descriptors H.3.5 [Online Information services]: Web-based services H 3.1 [Content Analysis and Indexing]: Indexing methods H 3.7 [Digital Libraries]: Dissemination, User issues General Terms Performance, Design, Standardization Keywords Social Tagging, Annotation, Harvesting, Metadata, Digital Collections, Ontology, Folksonomy 1. INTRODUCTION Over the past few years, collaborative tagging and annotation systems that involve communities of users creating and sharing their own metadata, have exploded on the Internet. Sites such as Flickr [1], Del.icio.us [2], Connotea [3] and LibraryThing [4] are considered exemplary of the Web 2.0 phenomena [5] because they use the Internet to harness collective intelligence. Such systems provide a community-driven, “organic” approach to classifying information and resources on the Web, so that they can be browsed, discovered and re-used. Proponents of social tagging systems [6-8] claim that because the terms used to describe the resources are community-defined, they are more topical, adaptive and relevant to users than traditional library cataloguing systems that use complex, relatively fixed, hierarchical thesauri and authority files. Terms in such controlled vocabularies do not evolve with popular language and many of them are irrelevant or anachronistic. Searches by non-experts often fail to yield results of relevance or that the users expect or understand. Authoritative metadata is also very expensive as it requires the time and effort of expert cataloguers. Social tagging and community annotation systems on the other hand, offer a mechanism by which the time consuming and expensive task of metadata generation can be distributed across communities. It is also argued that they provide a better measure of usefulness than software-based systems (e.g., Google) that rank resources based on the number of external links that point to a resource. However, recent analyses of tagging data [9,10, 51], have shown that the indexing terms input by untrained users are often inconsistent and inaccurate – causing documents to go undiscovered or discovered in the wrong category. There is also the problem of scalability. Its difficult to predict how Flickr, del.icio.us, and other folksonomy-dependent sites will scale as content volume escalates. The flat structure of “folksonomies” [11] (long lists of simple tags) are useful for serendipitous browsing. But they don't support more sophisticated searching and browsing over very large collections. Folksonomies will not organically: evolve into synonymous clusters; identify preferred terms; or accrue into broader and narrower terms; features that are supported by thesauri and ontologies. Finally, a major limitation of social tagging systems is their lack of interoperability. Many of the popular social tagging systems are centralized, non- interoperable with other systems, don’t support multiple levels of sharing and generally don’t employ standards. Initiatives such as TagCommons [12] are investigating mechanisms and open standards to improve the interoperability of these popular community tools so tags can be shared across communities and resource types. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’08, June 16–20, 2008, Pittsburgh, Pennsylvania, USA. Copyright 2008 ACM 978-1-59593-998-2/08/06...$5.00. 147