A Novel Approach for Blind Source Separation of Mixed Document Images in Farsi Scanned Documents Hossein Ghanbarloo, Farbod Razzazi Department of Electrical Engineering Islamic Azad University, Science and Research Branch Tehran, Iran hosseinghanbarloo@gmail.com , razzazi@srbiau.ac.ir Shahpour Alirezaee Department of Electrical Engineering Zanjan University Zanjan, Iran alirezaee@znu.ac.ir Abstract— In the field of mixed scanned documents separation, various studies have been carried out to reduce one (or more) unwanted artifacts from the document. Most of the approaches are based on comparison of the front and back sides of the documents. In some cases, it has been suggested to analyze the colored images, however, because of the calculation complexity of the approaches, they are not very applicable in practical applications. Furthermore none of them are tested on Farsi documents. In this paper, an applicable approach to large size images is presented which is based on image block segmentation (mosaicing). The advantages of this approach are less memory usage, combining of simultaneous and ordinal blind source separation methods in order to increase their efficiency, reducing calculation complexity of the algorithm into twenty percents of the basic algorithm, and high stability in noisy images. In noiseless conditions, the average signal to noise ratio of the output images is obtained 29.25 db. Furthermore, all of these cases have been tested on Farsi official documents. By applying the suggested ideas, considerable accuracy is achieved in the results, at minimum time. In addition, various parameters of the proposed algorithm (e.g. the size of each block, appropriate initial point, and number of iterations) were optimized. Keywords-component; formatting; Blind source Separation; Independent component Analysis; show-through; background removing; feed-through; scanned documents. I. INTRODUCTION In many cases, there is some additional information on scanned or photographed documents in addition to the main document image. Some of this extra information may not visually detectable. Depending on the user's goal, it may be interesting to highlight or remove them. In the document imaging procedure, due to blazing radiation of light, the image of the back side of the document may be mixed with the front side in the resulted image. In this effect, if the back side ink affects the front side image of the document, the effect is named "bleed-through", and if the document is two-sided or there are consecutive thin documents in the imaging procedure, the effect is called as "show- through" effect. Sometimes, it may be needed to separate various layers of the scanned image in order to focus on each of them. Background removing is a remarkable example in document analysis [1, 2]. Undesired effects on background of a document consists of various elements such as optical blurring and noise caused by scanning, dots, under writings and over writings [3, 4, 5]. These cases are strictly important in restoration the image and retrieving data from ancient documents. The ideas suggested in this paper can be used in decoding of security documents, in addition to OCR application, handwritings and subscript detection. In this issue, various techniques are available to increase the document's quality. One of the earliest methods for document quality enhancement is the binarization approach which produces acceptable results in the case that one source is accompanied when noise exists [6]. Some researchers have tried to reduce the show-through effect by using blind source separation (BSS) approach. Although this have been done by registration of the documents of both sides of the paper. The detection of this one by one correspondence is a hard and time-consuming task [7, 8]. A competing approach is employing Markov model for bleed-through removing, which has caused more readability in the resulted document [9, 10, and 11]. In addition, Markov model is useful in enhancing the reconstructed image obtained by BSS approach [12]. In [13], a comparison has been performed between independent component analysis (ICA) and diffusion methods, and the advantages of diffusion method are illustrated. In addition, in [14], various approaches among ICA has been compared in order to evaluate the quality enhancement of the document based on providing a colored scan of the document, then analyzing it to RGB components, and finally testing on ancient documents containing watermarked and hidden text. In the mentioned study, by analyzing the RGB components, the need to scan and register the other side of the document has been removed. In [15], this procedure has been used by focusing on documents containing show-through, ink-bleed, and palimpsests. In this way, sources have been separated by analyzing the colored image into three RGB components and then using ICA approach to solve the equations set containing three equations (colored components) and three unknown elements (background, body text, foreground). In addition, 978-1-4244-8230-6/10/$26.00 ©2010 IEEE 133