Automatic Generation of Character Groundtruth for Scanned Documents: A Closed-Loop Approach” zyxw Tapas Kanungo Robert M. Haralick Caere Corporation 100 Cooper Court Los Gatos, CA, 95030, USA tapas zyxwvu 62 caere. com Department of Electrical Engineering University of Washington Seattle, WA 98195, USA haralickeee.was; hington.edu Abstract zyxwvutsrq Character groundtruth for scanned document im- ages as crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating doc- ument degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not possible be- cause (a) accuracy an delineating groundtruth charac- ter bounding boxes is not high enough, (ii) it is ex- tremely laborious and tame consuming and (iii) the manual labor required for this task is prohibitively ex- pensive. In this paper we present a closed-loop methodol- ogy for collecting very accurate (within a pixel er- ror) groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and scanned. zyxwvutsrqpo A registration algorithm estimates the ge- ometric transformation that registers the ideal docu- ment image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric trans- form to create the groundtruth for the scanned doc- ument image. This methodology is very general and can be used for creating groundtruth for documents typeset in any language, layout, font, and style. The cost of creating groundtruth using our methodology is minimal. We use this methodology to groundtruth 33 English doc- uments consisting of over 62000 symbols. The pro- cedure takes approximately 5 minutes to groundtruth each page on a SUN Sparc 10. Furthermore, we use the method to groundtruth Hindi and FAX documents without any modification to our procedure. Our soft- ware will be made available to researchers shortly. Keywords: Groundtruth, document analysis, per- formance evaluation, registration, geometric transfor- mations, image warping, FAX. 1 Introduction Character groundtruth for real, scanned document images is crucial for evaluating the performance of *This work was done when Kanungo was a t the University of Washington. OCR systems, training OCR algorithms, and vali- dating document degradation models. Unfortunately, manual collection of accurate groundtruth for charac- ters in a real (scanned) document image is not pos- sible because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming and (iii) the manual labor required for this task is prohibitively ex- pensive. In this paper we give a closed-loop methodology for collecting very accurate (within a pixel error) groundtruth for scanned documents. The groundtruth generated by this method, besides being directly use- ful for evaluating the performance of OCR systems, is crucial for va1idat:ing document degradation models We are unaware: of any literature that uses a method similar to ours for automatically collecting groundtruth. However, lot of work on document reg- istration has been reported in the past. Most of this literature pertains to the problem where an ideal form has to be registered. to a scanned, hand-filled form. The general idea is to extract the information filled by a human in the various fields of the form. A com- mon method is to extract features from the scanned forms and match them to the features in the ideal form [2, zyxwvu 11. Unfortunately we cannot use this body of work since there are no universal landmarks that appear in each type of document. 2 Document groundtruth Groundtruth information is essential for evaluating any document understanding system. By groundtruth we mean the correcA location, size, font type, and bounding box of the individual symbols on the docu- ment image. More global groundtruth associated with a document image could include layout information (such as zone bounlding boxes demarcating individ- ual words, paragraphs, article and section titles, ad- dresses, footnotes, ta.ble and figure captions, etc.) and style information (such as one column, or two columns; right justified or not; etc). The groundtruth infor- mation, of course, needs to be 100 percent accurate, otherwise the systems being evaluated will be penal- ized incorrectly. Having such groundtruth allows a re- searcher to study which factors affect the algorithm’s [8, 61. 1015-4651196$5.00 zyxwvutsr 0 1996 IEEE zyxwvut Proceedings of ICPR ’96 669