A Novel Lexicon Reduction Method For Arabic Handwriting Recognition
Safwan Wshah, Venu Govindaraju
Department of Computer Science and Engineering
University at Buffalo
Amherst, NY, USA
{srwshah, govind}@buffalo.edu
Yanfen Cheng
1
and Huiping Li
2
1
Wuhan University of Technology, School of Computer
Science, China
2
Applied Media Analysis, Inc, USA
{chengyanfen,huipingli}@gmail.com
Abstract—In this paper, we present a method for lexicon size
reduction which can be used as an important pre-processing for
an off-line Arabic word recognition. The method involves
extraction of the dot descriptors and PAWs (Piece of Arabic
Word ). Then the number and position of dots and the number of
the PAWs are used to eliminate unlikely candidates. The
extraction of the dot descriptors is based on defined rules
followed by a convolutional neural network for verification. The
reduction algorithm makes use of the combination of two
features with a dynamic matching scheme. On IFN/ENIT
database of 26459 Arabic handwritten word images we achieved
a reduction rate of 87% with accuracy above 93%.
Keywords- Lexicon redeuction; arabic offline handwritten;
handwritten recognition.
I. INTRODUCTION
All manuscripts must be in English. These guidelines
include complete descriptions of the fonts, spacing, and
related information for producing your proceedings
The Arabic letters were created around the 7th century by
adding dots to existing letters. Therefore several letters have
exactly the same base form and differ only by single, double
or triple dots [1]. Other small marks (diacritics) are used to
indicate short vowels, but are often not used. Also the
Arabic script, both handwritten and printed, is cursive and
the letters are joined together along a writing line [2]. The
Arabic alphabet includes 28 letters (Figure 1), each with two
or four shapes depending on the position it stays in a sub-
word: start, middle, end of a sub-word, or alone. The Arabic
text is written from right to left, and adjacent letters are
joined together except the sub-word ending with one of the
six red letters in Figure 1. 15 of 28 Arabic characters are
dotted, with ten characters having one dot, three having two
dots and two having three dots. These characters contain a
unique main stroke, and are only distinguished by the
presence/absence, position or number of dots.
Figure 1. (a) The Arabic alphabet.
Research in off-line handwritten word recognition has
traditionally concentrated on relatively small lexicons from
ten to a thousand words. The Arabic language has large
lexicons containing 30,000 to 90,000 words [1].
Recognition with a large lexicon can be made more efficient
by initially eliminating lexicon entries that are unlikely to
match the given image. This process is called lexicon
reduction or lexicon pruning, and can be used as an
important pre-processing to improve the recognition
accuracy and speed by removing classifier confusion [3].
A. Measuring lexicon reduction performance
Given a set of n word images and a corresponding lexicon,
we denote the lexicon corresponding to image xi by Li. A
lexicon reduction algorithm takes xi and Li as input and
determines a reduced lexicon Qi ⊆Li. We denote the event
that ti is contained in the reduced lexicon by a random
variable A, where A=1, if ti∈Qi ; and A=0, otherwise. The
extent of reduction is captured by random variable R,
defined as R=(|Li|- |Qi|)/|Li| [4].
Three measures of lexicon reduction performance are
defined as follows;
• Accuracy of reduction: α = E(A).
• Degree of reduction: ρ = E(R).
• Reduction efficacy:η = αk .ρ.
Note that α,ρ,η∈[0,1]. The accuracy and degree of
reduction are usually related inversely to each other. α can
often be made arbitrarily close to unity at the expense of ρ.
The two measures are combined into one overall measure η.
The emphasis placed on accuracy relative to the degree of
reduction is expressed as a constant k, which in turn may be
determined by a particular application [4][7].
B. Background: Dot and PAW features
Some Arabic characters have exactly the same base forms
and are distinguished only by presence, position and the
number of dots. Double dots are often written as one
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.702
2857
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.702
2869
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.702
2865
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.702
2865
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.702
2865