Enhanced Text Extraction from Arabic Degraded Document Images using
EM Algorithm
Wafa Boussellaa
1
, Aymen Bougacha
1
, Abderrazak Zahour
2
, Haikal EL Abed
3
, Adel Alimi
1
1
University of Sfax, REGIM, ENIS, Route Soukra, BPW, 3038, Sfax, Tunisia
2
IUT, Université du Havre, Place Robert Schuman, 76610 Le Havre, France
3
Technical University Braunschweig, Institute for Communication Technology (IfN), Germany
Email:{wafa.boussellaa,adel.alimi,elabed}@ieee.org, Aymen.Bougacha@gmail.com,
abderrazak.zahour@univ-lehavre.fr
Abstract
This paper presents a new enhanced text extraction
algorithm from degraded document images on the
basis of the probabilistic models. The observed
document image is considered as a mixture of
Gaussian densities which represents the foreground
and background document image components. The EM
algorithm is introduced in order to estimate and
improve the parameters of the mixtures of densities
recursively. The initial parameters of the EM
algorithm are estimated by the k-means clustering
method. After the parameter estimation, the document
image is partitioned into text and background classes
by the means of ML approach. The performance of the
proposed approach is evaluated on a variety of
degraded documents comes from the collections of the
National library of Tunisia.
1. Introduction
The automatic processing of degraded historical
documents is a challenge in document image analysis
field which is confronted with many difficulties due to
the storage condition and the complexity of their
content. For historical degraded and poor quality
documents, enhancement is not an easy task. The main
interest of an enhancement step of historical documents
is to remove information coming from the background.
Background artifacts can derive from many kinds of
degradation, such as scan optical blur and noise, spots,
underwriting, or overwriting. Most previous document
image enhancement algorithms have been designed
primarily for binarization. Binarization aim to extract
text from distorted degraded documents and its related
methods are proposed for processing gray documents
which have not been extended for color documents.
In this paper an enhanced text extraction method is
proposed on the basis of the maximum likelihood (ML)
estimation for the segmentation problem. The difficult
task lies in how to estimate the parameters of the
likelihood functions and the number of segments.
Eventually, the expectation maximization algorithm
(EM) algorithm is introduced in order to improve the
parameter estimation. The initial estimates for the EM
algorithm are given by the k-means clustering
algorithm to avoid the problem of random initial
selection. The text and background segmentation is
performed by the conventional ML method. The
segmented image is used to produce a final colored
restored document image
The rest of this paper is organized as follows.
Section 2 gives an overview of previous work of
degraded document image enhancement. Section 3 and
4 describes the proposed algorithm in details.
Experimental results are presented in section 5.
Conclusion and future work are given in last section.
2. Related work
According to the literature, approaches that deals
with document image enhancement and text extraction
are based on binarization or foreground/background
separation techniques. Most previous document image
enhancement algorithms have been designed primarily
for binarization. Binarization is performed either
globally or locally. Global thresholding methods are
not sufficient since document images usually are
degraded including shadows, non-uniform illumination
and low contrast. Local methods are shown to perform
better according to recent exhaustive survey of image
binarization methods presented in [15]. Two main
approaches are distinguished, based local thresholding
methods and clustering based methods.
Based local thresholding techniques have been
proposed to estimate a different threshold for each
2009 10th International Conference on Document Analysis and Recognition
978-0-7695-3725-2/09 $25.00 © 2009 IEEE
DOI 10.1109/ICDAR.2009.220
743