Improving the Quality of Degraded Document Images
Ergina Kavallieratou and Efstathios Stamatatos
Dept. of Information and Communication Systems Engineering.
University of the Aegean
83200 – Karlovassi, Greece
{ergina,stamatatos}@aegean.gr
Abstract
It is common for libraries to provide public access
to historical and ancient document image collections.
It is common for such document images to require
specialized processing in order to remove background
noise and become more legible. In this paper, we
propose a hybrid binarizatin approach for improving
the quality of old documents using a combination of
global and local thresholding. First, a global
thresholding technique specifically designed for old
document images is applied to the entire image. Then,
the image areas that still contain background noise are
detected and the same technique is re-applied to each
area separately. Hence, we achieve better adaptability
of the algorithm in cases where various kinds of noise
coexist in different areas of the same image while
avoiding the computational and time cost of applying a
local thresholding in the entire image. Evaluation
results based on a collection of historical document
images indicate that the proposed approach is effective
in removing background noise and improving the
quality of degraded documents while documents
already in good condition are not affected.
1. Introduction
Historical and ancient document collections
available in libraries throughout the world are of great
cultural and scientific importance [1-2]. The
transformation of such documents into digital form is
essential for maintaining the quality of the originals
while provide scholars with full access to that
information [3]. It is quite common for such documents
to suffer from degradation problems [4]. Just to name a
few, presence of smear, strains, background of big
variations and uneven illumination, seepage of ink etc.
are factors that impede (in many cases may disable) the
legibility of the documents. Therefore, appropriate
filtering methods should be developed in order to
remove noise from historical document images and
improve their quality before libraries expose them to
public view. Within this framework, noise is
considered anything that is irrelevant with the textual
information (i.e., foreground) of the document image.
Image analysis systems use binarization as a
standard procedure to convert a grey-scale image to
binary form. An ideal binarization algorithm would be
able to perfectly discriminate foreground from
background, thus, removing any kind of noise that
obstructs the legibility of the document image. The
binary image is ideal for further processing [5] (e.g.,
discrimination of printed from handwritten text,
recognition of the contents by applying OCR
techniques etc). However, in the framework of a
library collection of historical and ancient documents
intended to be exposed to public view, the document
images in many cases do not need further processing
apart from removing the background noise and leave
some “traces of time” behind. More importantly, given
such a case, after the removal of background noise, it is
possible for the document images to remain in grey-
scale form. For instance, consider the images of figure
1. Figure 1a shows the original image, figure 1b the
result of the binarization procedure, and figure 1c the
corresponding grey-scale result after removing the
background noise. In the latter case the remaining
noise and the text characters are smoothed a little bit.
This has the consequence of making the background
noise practically invisible while the characters are
more legible. Some binarization techniques support
this option [6].
Traditional binarization approaches can be divided
into two main categories:
1. Global thresholding methods: The pixels of the
image are classified into text or background
according to a global threshold. Usually, such
methods are simple and fast. On the other hand,
they cannot be easily adapted in case the
background noise is unevenly distributed in the
entire image (e.g., smear or strains) [7-8].
Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06)
0-7695-2531-8/06 $20.00 © 2006 IEEE