Council for Innovative Research International Journal of Computers and Technology www.ijctonline.com ISSN: 2277-3061 Volume 3. No.1, AUG, 2 012 1 | Page www.cirworld.com PRINTED ARABIC CHARACTERS CLASSIFICATION USING A STATISTICAL APPROACH Ihab Zaqout Dept. of Information Technology Faculty of Engineering & Information Technology Al-Azhar University – Gaza Palestine ABSTRACT In this paper, we propose simple classifiers for printed Arabic characters based on statistical analysis. 109 printed Arabic character images are created for each one of transparent, simplified and traditional Arabic fonts. Images are preprocessed by the binarization and followed by sequence of morphological operations. A non-linear filter is applied on the thinned ridge map to extract termination and bifurcation features. The thinned ridge map vectors (TRMVs) are created using a freeman chain code template. The spatial distribution and statistical properties of the extracted features are calculated. Keywords Freeman chain coding; character recognition; feature extraction; classification. 1. INTRODUCTION This paper aims to introduce 109 classifiers including thinned ridge map vectors (TRMVs) for each one of transparent, simplified and traditional Arabic fonts. The work on TRMVs is left as a future work for Arabic text classification.The Arabic language occupies the fifth place among the languages most commonly used worldwide and the attainment of the proportion of Arabic speakers around 7% of the population of the world. The estimated number of Arabic speakers around the world is about 437 million people, including 85 million active users on the Internet. Published research on identifying the Arabic letters, whether printed or handwritten is very few compared to the published research on English character recognition. It is one of the most challenging tasks and exciting areas of research in Optical Character Recognition (OCR). Despite the growing interest in the work of researchers in the identification of Arabic texts which starts at the beginning of the eighties [1], until now there is no a comprehensive algorithm, due to the difficulty of writing rules of Arabic characters. Zidouri [2] proposed a sub- word segmentation and recognition. A three layered radial basis function network for training and 8- neighbor connected component algorithm is applied for segmentation. In recognition, they use a PCA on 200 binary images of 32x32. A main line algorithm is proposed by Al- Jarrah et al. [3] for segmentation to tokenize the text and generates a set of 33 different tokens that represent the 28 Arabic characters and their different shapes and variation. A forward neural network is used to recognize the segmented characters. A recognition algorithm based on feature extraction and using a Fuzzy ART Neural Network is proposed by Almohri et al. [4]. Sarhan and Helalat [5] proposed a statistical analysis for feature extraction and ANN for recognition. The ANN is trained using the least Mean Squares (LMS) algorithm. Each typed Arabic letter is represented by a matrix of binary numbers that are used as input to a simple feature extraction system whose output, in addition to the input matrix, are fed to an ANN. Zheng [6] proposed feature extracted from the four edges and BPNN is implemented for recognition.Batawi and Abulnaja [7] proposed an optical character recognition voting (AOCRV) scheme based on the N-version programming (NVP) technique which is applied on 35 printed text samples. A generalized Hough transform is applied to recognize Arabic printed characters in different shapes is proposed by Sofien et al. [8]. It is tested on a set of 234,868 samples of Arabic characters in Arabic Transparent, Andalus and Traditional fonts. Hassin et al. [9] proposed a Hidden Markov model to recognize printed Arabic characters. Each character/word is entirely transformed into a feature vector and a vector quantization is used to transform the word skeleton into a sequence of symbols. Arabic text is distinguished from other languages because of the following characteristics: 1,Arabic Alphabet consists of 28 characters ( ) as shown Fig. 1, which increases according to the position of the letter in the word, bringing the number to 109 as shown in Table 1. For example, the letter (sheen) is written in four forms according to its position in the word (if is in the beginning of the word, in the middle of a word, at the end of the word, the letter is isolated). 2.Arabic text is cursive, whether printed or handwritten is written from right to left and letters connected to each other on the baseline. 3.Arabic characters differ in their standards, some of which is high for the baseline, some of which is lower than the baseline, for example, (waw), (Ra), (zen). The size depends on the location of a character in the word. 4.Arabic characters can be distinguished from each other by the number of components of the character, Some consist of one- part such as (Ra), (meem), (waw), etc., two-part such as (ba), (kaf), (noon), etc., three-part such as (qaf), (taa), (yaa), and four-part such as (thaa), (sheen). In addition, there are some ligatures such character (lamalef).