Life Science Journal 2014;11(10) http://www.lifesciencesite.com 1273 Arabic OCR Segmented-based System Hassanin M. Al-Barhamtoshy 1 and Mohsen A. Rashwan 2 1 Computing and Information Technology, King Abdulaziz University (KAU), Saudi Arabia 2 Electronics and Communication Department, Cairo University hassanin@kau.edu.sa , mrashwan@rdi-eg.com Abstract: A new investigation in the Arabic OCR system has presented for the offline recognition of machine- printed cursive words. Therefore, a reliable transformation mechanism will be used to transform image text into free text (ASCII or Unicode Texts), that can be directly searched by a computer. Therefore, traditional preprocessing model (segmentation phase) will be included to extract each word from image text and divide it into segments. Then, recognition phase will take place, to find the most likelihoods of each possible text/character class given the segments. Accordingly, many classifiers can be used such as neural networks, Naïve Bayes, HMM classifiers. Such likelihoods are used to feed special algorithm as input in such ways to recognize the entire word. The whole process of the proposed framework includes three main stages: preparation, training, and testing. The data preparation aims at scanning, data image selection, alignment, identify text regions, and separate non text or image regions. Second, the training stage takes place, to extract features and build up the related language model; such features will be used in the third stage. Accordingly, at the first stage the paper focuses on the techniques used for font sizing, binarization, skewing, clearing (denoising), and segmentation before recognition takes place. [Hassanin M. Al-Barhamtoshy and Mohsen A. Rashwan. Arabic OCR Segmented-based System. Life Sci J 2014;11(10):1273-1283]. (ISSN:1097-8135). http://www.lifesciencesite.com . 200 Keywords: Arabic; OCR; Segmented-based; System 1. Introduction Method for Arabic feature selection was initiated in a handwritten OCR purpose [1] based on well- known common features extracted from the training patterns. Therefore, an algorithm was implemented in an OCR system for recognizing one of the biggest standard handwritten Farsi/ Arabic digit datasets. A developed tool with proposed algorithm has been applied to find Table of Contents (ToC) pages in Urdu books without the use of OCR [2]. The proposed algorithm employed machine learning algorithm for segmenting the document image into digits and non- digits. So, vertical projection analysis is engaged to detect the column structure of a typical page. Recurrent neural network (RNN) has been used for recognizing patterns of cursive handwritten documents [3]. The proposed solution had error rate of 13.6 % in case of shape variations and 5.15% in case of character level. Another technique has been engaged to fragment printed Arabic texts in order to split the Arabic characters and then extracting features for each to be recognized [4]. Another approach is anticipated and attempted to identify and separate handwritten from printed text using the Bag of Visual Words model (BoVW). Firstly, blocks of interest are detected in the document image, and then a descriptor is calculated based on the BoVW [5]. The last classification of the blocks can be characterized as Handwritten, Machine Printed, or Noise. 2. Characteristics of Arabic Characters Arabic language is one of the most spoken languages in the world, 422 people around the world speak it, which considered being one of most considered languages around the globe [6]. Arabic speakers are increasing, therefore a number of Arabic documents and articles are increased. Arabic is also the language of the Qur'an, so Muslims of all nationalities, such as Indonesians, are familiar with it. This shows the importance of the Arabic Language in the world [6], [7], we can summarize several important differences: Arabic Alphabet consists of 28 consonants and 8 vowels/diphthongs. Short vowels are unimportant in Arabic, and indeed do not appear in writing. Arabic texts are read and written from right to left. Arabic texts are written in a cursive script, in which most characters are connected and their shapes vary according the position in the word. The presence of dots in Arabic letters (15 out of the 28). This leads to some characters being more prone to OCR errors than others are. The morphological and syntactic complexity of Arabic grammar, which results 60 in billion possible surface forms complicates error correction within dictionary-based solution. It is semi-cursive whether printed or handwritten. Each character has a connection point right and/or left linked on the baseline.