A Novel Lexicon Reduction Method For Arabic Handwriting Recognition Safwan Wshah, Venu Govindaraju Department of Computer Science and Engineering University at Buffalo Amherst, NY, USA {srwshah, govind}@buffalo.edu Yanfen Cheng 1 and Huiping Li 2 1 Wuhan University of Technology, School of Computer Science, China 2 Applied Media Analysis, Inc, USA {chengyanfen,huipingli}@gmail.com Abstract—In this paper, we present a method for lexicon size reduction which can be used as an important pre-processing for an off-line Arabic word recognition. The method involves extraction of the dot descriptors and PAWs (Piece of Arabic Word ). Then the number and position of dots and the number of the PAWs are used to eliminate unlikely candidates. The extraction of the dot descriptors is based on defined rules followed by a convolutional neural network for verification. The reduction algorithm makes use of the combination of two features with a dynamic matching scheme. On IFN/ENIT database of 26459 Arabic handwritten word images we achieved a reduction rate of 87% with accuracy above 93%. Keywords- Lexicon redeuction; arabic offline handwritten; handwritten recognition. I. INTRODUCTION All manuscripts must be in English. These guidelines include complete descriptions of the fonts, spacing, and related information for producing your proceedings The Arabic letters were created around the 7th century by adding dots to existing letters. Therefore several letters have exactly the same base form and differ only by single, double or triple dots [1]. Other small marks (diacritics) are used to indicate short vowels, but are often not used. Also the Arabic script, both handwritten and printed, is cursive and the letters are joined together along a writing line [2]. The Arabic alphabet includes 28 letters (Figure 1), each with two or four shapes depending on the position it stays in a sub- word: start, middle, end of a sub-word, or alone. The Arabic text is written from right to left, and adjacent letters are joined together except the sub-word ending with one of the six red letters in Figure 1. 15 of 28 Arabic characters are dotted, with ten characters having one dot, three having two dots and two having three dots. These characters contain a unique main stroke, and are only distinguished by the presence/absence, position or number of dots. Figure 1. (a) The Arabic alphabet. Research in off-line handwritten word recognition has traditionally concentrated on relatively small lexicons from ten to a thousand words. The Arabic language has large lexicons containing 30,000 to 90,000 words [1]. Recognition with a large lexicon can be made more efficient by initially eliminating lexicon entries that are unlikely to match the given image. This process is called lexicon reduction or lexicon pruning, and can be used as an important pre-processing to improve the recognition accuracy and speed by removing classifier confusion [3]. A. Measuring lexicon reduction performance Given a set of n word images and a corresponding lexicon, we denote the lexicon corresponding to image xi by Li. A lexicon reduction algorithm takes xi and Li as input and determines a reduced lexicon Qi Li. We denote the event that ti is contained in the reduced lexicon by a random variable A, where A=1, if tiQi ; and A=0, otherwise. The extent of reduction is captured by random variable R, defined as R=(|Li|- |Qi|)/|Li| [4]. Three measures of lexicon reduction performance are defined as follows; Accuracy of reduction: α = E(A). Degree of reduction: ρ = E(R). Reduction efficacy:η = αk .ρ. Note that α,ρ,η∈[0,1]. The accuracy and degree of reduction are usually related inversely to each other. α can often be made arbitrarily close to unity at the expense of ρ. The two measures are combined into one overall measure η. The emphasis placed on accuracy relative to the degree of reduction is expressed as a constant k, which in turn may be determined by a particular application [4][7]. B. Background: Dot and PAW features Some Arabic characters have exactly the same base forms and are distinguished only by presence, position and the number of dots. Double dots are often written as one 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.702 2857 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.702 2869 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.702 2865 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.702 2865 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.702 2865