Restoration of Archival Documents Using a Wavelet Technique Chew Lim Tan, Member, IEEE, Ruini Cao, and Peiyi Shen Abstract—This paper addresses a problem of restoring handwritten archival documents by recovering their contents from the interfering handwriting on the reverse side caused by the seeping of ink. We present a novel method that works by first matching both sides of a document such that the interfering strokes are mapped with the corresponding strokes originating from the reverse side. This facilitates the identification of the foreground and interfering strokes. A wavelet reconstruction process then iteratively enhances the foreground strokes and smears the interfering strokes so as to strengthen the discriminating capability of an improved Canny edge detector against the interfering strokes. The method has been shown to restore the documents effectively with average precision and recall rates for foreground text extraction at 84 percent and 96 percent, respectively. Index Terms—Document image analysis, wavelet enhancement, wavelet smearing, Canny edge detector, text extraction, image segmentation, bleed- through, show-through, noise cancellation, denoising. æ 1 INTRODUCTION AN important task in document image analysis is text segmenta- tion [1], [2]. However, this paper introduces a rather different problem of text segmentation, that is, how to extract clear text strings from interfering images originating from the reverse side. The motivation of this research arises from a request from the National Archives of Singapore to restore the appearance of their handwritten archival documents that have been kept over long periods of time during which the seeping of ink has resulted in double images as shown in Fig. 1. Given this problem, we have to separate three classes of objects: the foreground text, interfering strokes from the reverse side, and the background. Usually, the foreground writing appears darker than the interfering strokes. However, there are cases where the foreground and interfering strokes have similar intensities, or worse still, the interfering strokes are more prominent than the foreground. Many segmentation and binarization approaches have been reported in the literature [2], [3], [4], [5], [6], [7]. These methods aim to extract clear text from either noisy or textured background. However, they deal with one-sided documents and most methods basically assume separable gray scale and/or distinctive features between the foreground and background. Similar work can be seen in solving the “show-through” problem in scanning duplex printed documents. Don’s work segments the double-sided images based on the isolated gray-scale range of interfering images and the noise characteristics [8]. Sharma [9], [10] develops a unique scanning model and an adaptive linear-filtering scheme for removal of show-through using both sides of the document. Our problem, however, is different from show-through in that the interfering strokes are the result of “bleed-through” due to anisotropic absorption and spreading in the paper. As such, corresponding images on both sides may not be completely matched like in the “show-through” situation. Related work is seen in the denoising techniques presented by Donoho [11], [12] based on thresholding and shrinking empirical wavelet coefficients for recovering and/or denoising signals. Berkner et al. [13] propose a wavelet-based approach to sharpening and smoothing of images for use in deblurring or denoising of images. Our problem again differs from these works here in that the interfering strokes are not really noise but rather distinctive images in their own right that we want to distinguish and remove. Finally, Lu et al. [14], Lu [15] presents a similar wavelet method by decreasing the edge contrast and smearing the direct components of the edges with its neighboring pixels. Though his edge-based wavelet image preprocessing method can handle the change of feature coefficients (local maxima) [16], [17], [18], it is still inadequate for our present problem: 1) Due to the “bleed-through” problem discussed earlier, we find different edge positions between the interfering and original strokes on either side. 2) As a result, any mismatch between the interfering strokes observed on the front and their original strokes on the reverse side will result in a mistaken identity of interfering strokes as foreground edges. In view of the above, we propose a new method which differs from others in the following aspects. 1) We develop an improved Canny edge detector to suppress unwanted interfering strokes [19], [20]. 2) We process the image by using wavelet enhancing and smearing operations to work on the foreground and interfering strokes, respectively. 3) We adopt a set of wavelet enhancement and smearing coefficients in different scales instead of the traditional local maxima reconstruction method. 2 PROPOSED METHOD 2.1 Image Matching and Overlay It is natural that weak foreground strokes may not necessarily seep into the reverse side (Fig. 1d). On the other hand, interfering strokes must have been originated from strong foreground strokes on the reverse side. Thus, we match both images from either side of a page by hand. Let f ðx; yÞ denote the k bits per pixel gray-scale front images, and bðx; yÞ the reverse side image of the same page, where x and y represent the row and the column, respectively. The two images have the same dimension M N. An overlay operation is carried out as follows: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 10, OCTOBER 2002 1399 Fig. 1. Sample images: (a) sample1, front side, (b) sample1, reverse side, (c) sample2, front side, and (d) sample2, reverse side. . C.L. Tan is with the School of Computing, National University of Singapore, 3 Science Drive 2, Singapore, 117543. E-mail: tancl@comp.nus.edu.sg. . R. Cao is with Hotcard Technology Pte Ltd., 2 Jurong East Street 21, #05- 30 IMM Building., Singapore 609601. E-mail: caorn@hotcardtech.com. . P. Shen is with the Communication Solution Group, Agilent Technologies, Yishun Ave. 7, No. 1, Singapore, 768923. E-mail: pei-yi_shen@aglient.com. Manuscript received 27 Dec. 2000; revised 17 Aug. 2001; accepted 10 May 2002. Recommended for acceptance by L. Vincent. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number 113366. 0162-8828/02/$17.00 ß 2002 IEEE