Taking Entity Reconciliation Offline Ryan Shaw School of Information and Library Science University of North Carolina at Chapel Hill ryanshaw@unc.edu Patrick Golden School of Information and Library Science University of North Carolina at Chapel Hill ptgolden@live.unc.edu ABSTRACT Entity reconciliation—linking names or terms to identifiers in external datasets—is a popular method of adding standardized structured data to loosely structured documents. Most approaches to entity reconciliation rely on remote web services, requiring network access during the reconciliation process. For use cases that rely on a “human in the loop” (reconciling entities during the authoring process), this requirement may be a problem. To address this problem, we investigated the feasibility of offline entity reconciliation against the Virtual International Authority File. Offline entity reconciliation was implemented by taking advantage of newly standardized browser storage interfaces to store and query parts of this large dataset locally. We present the results of this investigation and our comparison of the performance, scalability, ease of implementation, and cross-browser compatibility of the various options for storing entity data locally. Keywords Linked data, name authorities, web interfaces. INTRODUCTION We first review the typical approaches taken to annotate loosely structured text with structured data. We claim that reconciliation against a database of entities is an attractive approach for many use cases. However, most implementations of reconciliation establish a dependency on web services, making some use cases difficult to support. We examine techniques for breaking this dependency by storing and reconciling against entity data locally. We present the results of a study in which we implemented and tested offline reconciliation using several combinations of operating systems, browsers, and local storage technologies. The local storage technologies are compared to one another in terms of ease of use, scalability, and performance. We conclude with a discussion of the implications of our study. STRUCTURED DATA Information professionals are well-acquainted with the benefits of adding standardized structured data (e.g. metadata) to loosely structured documents. Standardized structured data can bring consistency and interoperability to otherwise inconsistent and idiosyncratic documents, making them amenable to consumption and manipulation through generic tools. Faceted browsing and visualization are just two specific examples of this. While structured data can be authored directly using forms, another approach is attractive when authors are willing to re-use another author’s description of an entity (as in shared cataloging), or when there is an external source of structured data about the entities that can be exploited. For example, the restaurants a food blogger reviews are likely to be listed in a directory providing structured data. Medical thesauri will have structured data related to the terms a doctor uses in her notes. A place name gazetteer can provide structured data related to a place name. In all these cases an author need not re-enter this data but can simply reconcile the name or term he used with the external data source. Reconciliation involves an author linking a name or term to an external identifier, thereby disambiguating it and allowing him to gather structured data that others have associated with that identifier (Maali et al., 2011). Adding structured data to documents via reconciliation against an external data source typically introduces a dependency on web access. For use cases that cannot tolerate sparse or dirty data, and which therefore adopt a “human in the loop” model of reconciling entities during the authoring process, the need to be constantly online may be problematic. Consider the doctor making clinical observations in unconnected rural areas, or the historian taking research notes deep in an archive. Can adding structured data via reconciliation during authoring be feasible in these offline scenarios? LOCAL STORAGE TECHNIQUES The most basic approach to using browser storage for entity reconciliation is to serialize and store an index structure that is deserialized and loaded fully into memory upon page load. In theory, this method could be used to store a small entity index in cookies, but a better approach would be to use the Web Storage API. The Web Storage API (Hickson, 2011) better known as localStorage, enables persistent storage of key-value pairs. It is intended to be used to store data that should persist across browser sessions and are too large to be stored in cookies. Another option is to use the newer File API, which provides FileSaver (Uhrhane, 2012) and FileReader (Ranganathan, 2012) interfaces that can be ASIST 2013, November 1-6, 2013, Montreal, Quebec, Canada.