A Novel Approach for Blind Source Separation of
Mixed Document Images in Farsi Scanned
Documents
Hossein Ghanbarloo, Farbod Razzazi
Department of Electrical Engineering
Islamic Azad University, Science and Research Branch
Tehran, Iran
hosseinghanbarloo@gmail.com , razzazi@srbiau.ac.ir
Shahpour Alirezaee
Department of Electrical Engineering
Zanjan University
Zanjan, Iran
alirezaee@znu.ac.ir
Abstract— In the field of mixed scanned documents separation,
various studies have been carried out to reduce one (or more)
unwanted artifacts from the document. Most of the approaches are
based on comparison of the front and back sides of the documents.
In some cases, it has been suggested to analyze the colored images,
however, because of the calculation complexity of the approaches,
they are not very applicable in practical applications. Furthermore
none of them are tested on Farsi documents. In this paper, an
applicable approach to large size images is presented which is based
on image block segmentation (mosaicing). The advantages of this
approach are less memory usage, combining of simultaneous and
ordinal blind source separation methods in order to increase their
efficiency, reducing calculation complexity of the algorithm into
twenty percents of the basic algorithm, and high stability in noisy
images. In noiseless conditions, the average signal to noise ratio of
the output images is obtained 29.25 db. Furthermore, all of these
cases have been tested on Farsi official documents. By applying the
suggested ideas, considerable accuracy is achieved in the results, at
minimum time. In addition, various parameters of the proposed
algorithm (e.g. the size of each block, appropriate initial point, and
number of iterations) were optimized.
Keywords-component; formatting; Blind source Separation;
Independent component Analysis; show-through; background
removing; feed-through; scanned documents.
I. INTRODUCTION
In many cases, there is some additional information on
scanned or photographed documents in addition to the main
document image. Some of this extra information may not
visually detectable. Depending on the user's goal, it may be
interesting to highlight or remove them.
In the document imaging procedure, due to blazing
radiation of light, the image of the back side of the document
may be mixed with the front side in the resulted image. In this
effect, if the back side ink affects the front side image of the
document, the effect is named "bleed-through", and if the
document is two-sided or there are consecutive thin documents
in the imaging procedure, the effect is called as "show-
through" effect. Sometimes, it may be needed to separate
various layers of the scanned image in order to focus on each
of them. Background removing is a remarkable example in
document analysis [1, 2]. Undesired effects on background of a
document consists of various elements such as optical blurring
and noise caused by scanning, dots, under writings and over
writings [3, 4, 5]. These cases are strictly important in
restoration the image and retrieving data from ancient
documents. The ideas suggested in this paper can be used in
decoding of security documents, in addition to OCR
application, handwritings and subscript detection. In this issue,
various techniques are available to increase the document's
quality.
One of the earliest methods for document quality
enhancement is the binarization approach which produces
acceptable results in the case that one source is accompanied
when noise exists [6]. Some researchers have tried to reduce
the show-through effect by using blind source separation (BSS)
approach. Although this have been done by registration of the
documents of both sides of the paper. The detection of this one
by one correspondence is a hard and time-consuming task [7,
8]. A competing approach is employing Markov model for
bleed-through removing, which has caused more readability in
the resulted document [9, 10, and 11]. In addition, Markov
model is useful in enhancing the reconstructed image obtained
by BSS approach [12].
In [13], a comparison has been performed between
independent component analysis (ICA) and diffusion methods,
and the advantages of diffusion method are illustrated. In
addition, in [14], various approaches among ICA has been
compared in order to evaluate the quality enhancement of the
document based on providing a colored scan of the document,
then analyzing it to RGB components, and finally testing on
ancient documents containing watermarked and hidden text. In
the mentioned study, by analyzing the RGB components, the
need to scan and register the other side of the document has
been removed.
In [15], this procedure has been used by focusing on
documents containing show-through, ink-bleed, and
palimpsests. In this way, sources have been separated by
analyzing the colored image into three RGB components and
then using ICA approach to solve the equations set containing
three equations (colored components) and three unknown
elements (background, body text, foreground). In addition,
978-1-4244-8230-6/10/$26.00 ©2010 IEEE 133