Improving the Quality of Degraded Document Images Ergina Kavallieratou and Efstathios Stamatatos Dept. of Information and Communication Systems Engineering. University of the Aegean 83200 – Karlovassi, Greece {ergina,stamatatos}@aegean.gr Abstract It is common for libraries to provide public access to historical and ancient document image collections. It is common for such document images to require specialized processing in order to remove background noise and become more legible. In this paper, we propose a hybrid binarizatin approach for improving the quality of old documents using a combination of global and local thresholding. First, a global thresholding technique specifically designed for old document images is applied to the entire image. Then, the image areas that still contain background noise are detected and the same technique is re-applied to each area separately. Hence, we achieve better adaptability of the algorithm in cases where various kinds of noise coexist in different areas of the same image while avoiding the computational and time cost of applying a local thresholding in the entire image. Evaluation results based on a collection of historical document images indicate that the proposed approach is effective in removing background noise and improving the quality of degraded documents while documents already in good condition are not affected. 1. Introduction Historical and ancient document collections available in libraries throughout the world are of great cultural and scientific importance [1-2]. The transformation of such documents into digital form is essential for maintaining the quality of the originals while provide scholars with full access to that information [3]. It is quite common for such documents to suffer from degradation problems [4]. Just to name a few, presence of smear, strains, background of big variations and uneven illumination, seepage of ink etc. are factors that impede (in many cases may disable) the legibility of the documents. Therefore, appropriate filtering methods should be developed in order to remove noise from historical document images and improve their quality before libraries expose them to public view. Within this framework, noise is considered anything that is irrelevant with the textual information (i.e., foreground) of the document image. Image analysis systems use binarization as a standard procedure to convert a grey-scale image to binary form. An ideal binarization algorithm would be able to perfectly discriminate foreground from background, thus, removing any kind of noise that obstructs the legibility of the document image. The binary image is ideal for further processing [5] (e.g., discrimination of printed from handwritten text, recognition of the contents by applying OCR techniques etc). However, in the framework of a library collection of historical and ancient documents intended to be exposed to public view, the document images in many cases do not need further processing apart from removing the background noise and leave some “traces of time” behind. More importantly, given such a case, after the removal of background noise, it is possible for the document images to remain in grey- scale form. For instance, consider the images of figure 1. Figure 1a shows the original image, figure 1b the result of the binarization procedure, and figure 1c the corresponding grey-scale result after removing the background noise. In the latter case the remaining noise and the text characters are smoothed a little bit. This has the consequence of making the background noise practically invisible while the characters are more legible. Some binarization techniques support this option [6]. Traditional binarization approaches can be divided into two main categories: 1. Global thresholding methods: The pixels of the image are classified into text or background according to a global threshold. Usually, such methods are simple and fast. On the other hand, they cannot be easily adapted in case the background noise is unevenly distributed in the entire image (e.g., smear or strains) [7-8]. Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE