Off-line Farsi / Arabic Handwritten Word Recognition Using Vector Quantization and Hidden Markov Model Behroz Vaseghi Engineering Department, Islamic Azad University, Najafabad Branch, Iran vaseghi@znu.ac.ir Shahpour Alirezaee Engineering Department, University of Zanjan, Zanjan, Iran Alirezaee@znu.ac.ir Majid Ahmadi ECE Department, University of Windsor, Windsor, Ontario, Canada Ahmadi@uwindsor.ca Rasoul Amirfattahi ECE Department, Isfahan University of Technology, Isfahan, Iran fattahi@ iut.ac.ir Abstract-In this paper a Farsi handwritten word recognition system for reading city names in postal addresses is presented. The method is based on vector quantization (VQ) and hidden Markov model (HMM). The sliding right to left window is used to extract the proper features(we have proposed four features). After feature extraction, K-means clustering is used for generation a codebook and VQ generates a codeword for each word image. In the next stage, HMM is trained by Baum Welch algorithm for each city name. A test image is recognized by finding the best match (likelihood) between the image and all of the HMM words models using forward algorithm. Experimental results show the advantages ofusing VQIHMM recognizer engine instead of conventional discrete HMM. Keywords- Off-line character recogntion, HMM, VQ. I. INTRODUCTION During the past decade, pattern recognition community has achieved very remarkable progress in the field of handwritten word recognition. Many paper dealing with applications of handwritten word recognition to automatic reading of postal addresses, bank checks and forms (invoices, coupons, revenue documents etc.) have been published [1]. However, most of the works dealt with the recognition of Latin and Chinese scripts. However progress in Arabic script recognition has been slow mainly due to the special characteristics of Arabic scripts. Arabic text is inherently cursive both in handwritten and printed forms and is written horizontally from right to left. The reader is referred to [2-3] for further details on Arabic script characteristics and also the state of the art of Arabic character recognition. Farsi writing, which this paper addresses, is very similar to Arabic in terms of strokes and structure. The only difference is that Farsi has four more characters than Arabic in its character set. Therefore, a Farsi word recognizer can also be used for Arabic word recognition. This paper present a Farsi handwritten word recognition system based on discrete hidden Markov model and vector quantization. II. THE WORD RECOGNITION SYSTEM The proposed system is designed for reading the city names from address filed. This application falls within the limited 978-1-4244-2824-3/08/$25.00 ©2008 IEEE 575 lexicon category. The lexicon size is limited (198 city names) or can be pruned by using additional information like Zip codes. The block diagram of the system is illustrated in Fig. 1. An image of postal envelope is captured using a scanner with 300-dpi resolution and 256 gray levels. Then the name of city is extracted from the image and assigned an appropriate label between 1 to 198. Our database consists of 6000 word image of the cities in Iran. A. Preprocessing The preprocessing consists of the following steps: • Binarization: The gray level image of a word is binarized at a threshold determined by modified version of maximum entropy sum and entropic correlation methods. [4] • Noise removal: The binarized image often has spurious segments which are removed by a morphological closing operation followed by a morphological opening operation both with a 3x3 window as the structure element. • To Surround: For decrease of the memory volume and increase the speed of the processing the binarized image is surrounded in a circumferential rectangular (Fig.2b). B. Frame generation In this phase, the word image is converted to an appropriate sequential form suitable for HMM recognition engine. The area of the image is divided into a set of vertical fixed-width frames (strips) from right to left. The width of a frame is set to approximately twice of the average stroke width of the word image and there is a 50% overlap between two consecutive frames. c. Feature extraction In this stage from each frame of the image 4 features were extracted: Authorized licensed use limited to: University of South Australia. Downloaded on July 25, 2009 at 04:34 from IEEE Xplore. Restrictions apply.