International Journal of Computer Applications (0975 – 8887) Volume 103 – No.3, October 2014 35 The Thinning Problem in Arabic Text Recognition - A Comprehensive Review Atallah M. AL-Shatnawi Department of Computer Information Systems, Al-albayt University Khairuddin Omar Center for Artificial Intelligence Technology (CAIT), University Kebangsaan Malaysia ABSTRACT The goal of this paper is to present an overview about the thinning problem in Arabic text recognition. Thinning “Skeletonization” is a very crucial stage in the ACR, it simplifies the text shape and reduces the amount of data that needs to be handled and it is usually used as a pre-processing stage for recognition and storage systems. The skeleton of Arabic text can be used for each of the baseline detection, character segmentation, and features extraction and also ultimately supporting the classification. Choosing or designing the effective thinning algorithm for Arabic text is crucial in ACR. In this paper, the importances of the thinning for the ACR and the usage of the text skeleton in ACR system are discussed and presented. As well as the challenges that have an impact on the thinning of Arabic text are discussed. The methods of Arabic text thinning are discussed and reviewed based on the technique used, and the methods advantages and drawbacks are discussed in details. Keywords Thinning, Skeleton, Iterative, Non-iterative, Parallel, Sequential, Pre-processing, Arabic character recognition. 1. INTRODUCTION In the area of pattern recognition, language recognition is regarded as one of the highly sophisticated problems in the field of Artificial Intelligence [61]. As far as the Arabic language is concerned, the eventual goal of whichever Arabic Character Recognition (ACR) system is simulating human reading abilities in order that the computer can read, comprehend, edit, and perform similar activities to those which the human mind performs on texts [17][62]. The process of ACR can be divided into two complimentary sub- processes. One handles character’s image after it has been introduced to the system, via scanning, for example, which is a process described as offline recognition. The second process has various input methods in which the particular writer directly writes to the system, for example, by using a light pen as the means of input. This type of process is described as online recognition. Commonly, the offline problem is more difficult than the online problem because in the latter problem frequently more information is at hand. As an example, pen movement can be utilized as feature of the corresponding character [3][7][40][42][69]. The Arabic language is a universal language. It is the formal language of 25 countries whose overall population amount to greater than 300 million [5][39][42]. Moreover, numerous Arabic characters are employed by several languages worldwide like the Kurdish, Jawi, Iranian, and Ardu languages [69]. Nonetheless, development of systems for ACR did not receive adequate interest of the researchers in this field in comparison with the Japanese, Chinese, and Latin optical character recognition (OCR) systems [69]. For instance, Latin recognition began in 1940 [13] whereas it was in 1975 when the first attempt for to recognizing Arabic characters was carried out [51]. In this regard, the literature provides cumulative evidence on that recognition of the Arabic language is much harder than recognition of characters of many other languages like the Chinese and Latin languages owing to that in the Arabic language text characteristics are really complex and the texts are cursively written [5][18][20]. Additional details on the characteristics of Arabic writing have been provided by [3][8][9][40][42][69]. The ideal system for ACR processes the image using five steps: acquisition, preprocessing, segmentation, feature extraction, and recognition (or classification) of the image [3][40][42][50]. Each step of these does contribute to the ultimate recognition rate and to enhancement of the OCR system [61][62]. The pre-processing stage in ACR is critical due to that it does directly impact reliability and the efficiency of each of the three subsequent processes [8]. Within this context, for an enhancement of performance of the ACR system, the pre-processing phase should include smoothing, elimination of noise, decomposition of image, detection and correction of skewness, and detection and thinning of text baseline [8]. In view of this, the focus of this study is the process of Arabic text thinning. This paper is organized as follows; section 2 describes the thinning process of Arabic text. Section 3 clarifies the challenges that affect the thinning of Arabic text. Section 4 provides a survey of the thinning methods being applied on Arabic text. Finally, section 5 provides the discussion and conclusion. 2. THINNING OF ARABIC TEXT In Image processing, the extraction of the skeleton of the digital images is widely used as a preprocessing stage for recognition and storage systems [65]. Skeleton of characters is the single pixel wide set of it [49]. The main aim of the thinning process is to reduce the outline of a character in which the several-pixel character thickness is reduced in the process of thinning into a single pixel shape that forms the basis of the character [55]. It eliminates a lot of unwanted information, and hence reduces the memory required for storing the structural information [45]. The skeleton products usually construct from a set of lines, curves, and loops [7][12][10][11][30]. Thinning is used in many recognitions applications such as, text, chromosome, and finger print analysis [43][45][41]. Thinning is very important in ACR system, it simplifies the Arabic texts shapes for segmentation