Document Style Census for OCR George Nagy DocLab, Rensselaer Polytechnic Institute Troy, NY, USA 12180 nagy@ecse.rpi.edu Prateek Sarkar Palo Alto Research Center Palo Alto, CA, USA 94087 psarkar@parc.com Abstract Four methods of converting paper documents to computer-readable form are compared with regard to hypothetical labor cost: keyboarding, omnifont OCR, style- specific OCR, and style-constrained or style- adaptive OCR. The best choice is determined primarily by (1) the reject rates of the various OCR systems at a given error rate, (2) the fraction of the material that must be labeled for training the system, and (3) the cost of partitioning the material according to style. For large corpora, sampling strategies are proposed both for estimating conversion costs and for taking advantage of style homogeneity. 1. Introduction The cost of scanning a collection of paper documents is fairly predictable, but the cost of converting the resulting bitmap images to a searchable form (i.e., by OCR and keyboarding) is not. We discuss issues related to estimating the cost of conversion, including the type of data that must be collected to plan and execute such a conversion and the choice of conversion methodology. We call the selective collection of relevant data a document census. As in the case of a demographic census, both exhaustive enumeration and sampling are required. The conversion of documents to digital form is of interest for increasing accessibility to both content and form. Broad access to the latter is the primary purpose of document preservation, which usually targets relatively few but precious documents. The problem of preserving important historical documents is similar to that of producing high-quality facsimile editions. Even when this is accomplished with digital scanners or digital cameras, as opposed to analog reproduction (i.e., film photography), the production of a transcription is secondary to the retention of precise image detail. (Of course, however elaborate, facsimiles cannot be expected to preserve some aspects, like paper thickness or ink composition.) We consider here only documents where the content is of interest, rather than the document artifact itself. Document format is another matter: it is generally agreed that at least some format information must usually be retained for effective access to content. In many collections, the separation of text from illustrations is difficult. Since we are now investigating primarily the OCR aspects of document conversion, we concentrate on mainly-text documents, and assume that automated algorithms combined with human interaction accomplish text- graphics separation. We also neglect contextual methods based on language or application-specific models, which can greatly benefit all the methods discussed herein. 2. Specialized document collections The vast majority of all documents on the web are freely accessible. A library can also offer its clients some exclusive collections, Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04) 0-7695-2088-X/04 $20.00 © 2004 IEEE