Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm Wafa Boussellaa 1 , Aymen Bougacha 1 , Abderrazak Zahour 2 , Haikal EL Abed 3 , Adel Alimi 1 1 University of Sfax, REGIM, ENIS, Route Soukra, BPW, 3038, Sfax, Tunisia 2 IUT, Université du Havre, Place Robert Schuman, 76610 Le Havre, France 3 Technical University Braunschweig, Institute for Communication Technology (IfN), Germany Email:{wafa.boussellaa,adel.alimi,elabed}@ieee.org, Aymen.Bougacha@gmail.com, abderrazak.zahour@univ-lehavre.fr Abstract This paper presents a new enhanced text extraction algorithm from degraded document images on the basis of the probabilistic models. The observed document image is considered as a mixture of Gaussian densities which represents the foreground and background document image components. The EM algorithm is introduced in order to estimate and improve the parameters of the mixtures of densities recursively. The initial parameters of the EM algorithm are estimated by the k-means clustering method. After the parameter estimation, the document image is partitioned into text and background classes by the means of ML approach. The performance of the proposed approach is evaluated on a variety of degraded documents comes from the collections of the National library of Tunisia. 1. Introduction The automatic processing of degraded historical documents is a challenge in document image analysis field which is confronted with many difficulties due to the storage condition and the complexity of their content. For historical degraded and poor quality documents, enhancement is not an easy task. The main interest of an enhancement step of historical documents is to remove information coming from the background. Background artifacts can derive from many kinds of degradation, such as scan optical blur and noise, spots, underwriting, or overwriting. Most previous document image enhancement algorithms have been designed primarily for binarization. Binarization aim to extract text from distorted degraded documents and its related methods are proposed for processing gray documents which have not been extended for color documents. In this paper an enhanced text extraction method is proposed on the basis of the maximum likelihood (ML) estimation for the segmentation problem. The difficult task lies in how to estimate the parameters of the likelihood functions and the number of segments. Eventually, the expectation maximization algorithm (EM) algorithm is introduced in order to improve the parameter estimation. The initial estimates for the EM algorithm are given by the k-means clustering algorithm to avoid the problem of random initial selection. The text and background segmentation is performed by the conventional ML method. The segmented image is used to produce a final colored restored document image The rest of this paper is organized as follows. Section 2 gives an overview of previous work of degraded document image enhancement. Section 3 and 4 describes the proposed algorithm in details. Experimental results are presented in section 5. Conclusion and future work are given in last section. 2. Related work According to the literature, approaches that deals with document image enhancement and text extraction are based on binarization or foreground/background separation techniques. Most previous document image enhancement algorithms have been designed primarily for binarization. Binarization is performed either globally or locally. Global thresholding methods are not sufficient since document images usually are degraded including shadows, non-uniform illumination and low contrast. Local methods are shown to perform better according to recent exhaustive survey of image binarization methods presented in [15]. Two main approaches are distinguished, based local thresholding methods and clustering based methods. Based local thresholding techniques have been proposed to estimate a different threshold for each 2009 10th International Conference on Document Analysis and Recognition 978-0-7695-3725-2/09 $25.00 © 2009 IEEE DOI 10.1109/ICDAR.2009.220 743