Document Style Census for OCR
George Nagy
DocLab, Rensselaer Polytechnic Institute
Troy, NY, USA 12180
nagy@ecse.rpi.edu
Prateek Sarkar
Palo Alto Research Center
Palo Alto, CA, USA 94087
psarkar@parc.com
Abstract
Four methods of converting paper
documents to computer-readable form are
compared with regard to hypothetical labor
cost: keyboarding, omnifont OCR, style-
specific OCR, and style-constrained or style-
adaptive OCR. The best choice is
determined primarily by (1) the reject rates
of the various OCR systems at a given error
rate, (2) the fraction of the material that
must be labeled for training the system, and
(3) the cost of partitioning the material
according to style. For large corpora,
sampling strategies are proposed both for
estimating conversion costs and for taking
advantage of style homogeneity.
1. Introduction
The cost of scanning a collection of paper
documents is fairly predictable, but the cost
of converting the resulting bitmap images to
a searchable form (i.e., by OCR and
keyboarding) is not. We discuss issues
related to estimating the cost of conversion,
including the type of data that must be
collected to plan and execute such a
conversion and the choice of conversion
methodology. We call the selective
collection of relevant data a document
census. As in the case of a demographic
census, both exhaustive enumeration and
sampling are required.
The conversion of documents to digital form
is of interest for increasing accessibility to
both content and form. Broad access to the
latter is the primary purpose of document
preservation, which usually targets
relatively few but precious documents. The
problem of preserving important historical
documents is similar to that of producing
high-quality facsimile editions. Even when
this is accomplished with digital scanners or
digital cameras, as opposed to analog
reproduction (i.e., film photography), the
production of a transcription is secondary to
the retention of precise image detail. (Of
course, however elaborate, facsimiles cannot
be expected to preserve some aspects, like
paper thickness or ink composition.) We
consider here only documents where the
content is of interest, rather than the
document artifact itself. Document format is
another matter: it is generally agreed that at
least some format information must usually
be retained for effective access to content.
In many collections, the separation of text
from illustrations is difficult. Since we are
now investigating primarily the OCR
aspects of document conversion, we
concentrate on mainly-text documents, and
assume that automated algorithms combined
with human interaction accomplish text-
graphics separation. We also neglect
contextual methods based on language or
application-specific models, which can
greatly benefit all the methods discussed
herein.
2. Specialized document collections
The vast majority of all documents on the
web are freely accessible. A library can also
offer its clients some exclusive collections,
Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04)
0-7695-2088-X/04 $20.00 © 2004 IEEE