Simultaneous Segmentation and Recognition of Arabic Printed Text Using Linguistic Concepts of Vocabulary Mohamed Ben Halima 1 and Adel M. Alimi 1 1 The high school of National Engineering of Sfax B.P W.3038 Sfax-Tunisia ABSTRACT In this paper, we propose a new approach to Arabic printed text analysis and recognition. This approach is based on linguistic concepts of Arabic vocabulary. For the text, we allow to categorize the words in decomposable words (derived from a root) and indecomposable words (not derived from a root) and to put forth morpho-syntactic characterization hypotheses for each word. For the decomposable words, we attempt to recognize word basic morphemes: antefix, prefix, infix, suffix, postfix and root contrary to existing approaches which are usually based on recognition of word entity by holistic approach. Keywords: Arabic Text, Segmentation, Recognition, Linguistic Concepts 1. INTRODUCTION The automatic processing of the Arabic language is a sector which includes issues such industrial and economic, scientific and technical type, but also has a very specific cultural dimension. Inspite the fact that almost three hundred million people worldwide use the Arabic alphabet, research on optical recognition of the Arabic script is not as advanced as other writings (Latin, Japanese or Chinese). Early work focused solely on the recognition of individual characters. The first work published in 1975 has involved writing printed [1]. Until 1980, very little work has involved writing Arabic. During this period, work has always involved the recognition of isolated printed character. In 1980, Amin suggested the first system on the recognition of handwritten characters within its system IRAC [2]. During the last two decades, research on the recognition of the nature of Arabic script printed and handwritten have progressed considerably [3] [4] [5] [6] [7] [8] [9] [10] [11]. But, a lot of work is still interested in the isolated character recognition [12] [13]. In contrast, research on the recognition of words and texts are still relatively limited compared to the work performed for other scriptures such as the Latin script. This time, and the shortfall in research on the recognition of the Arabic script, by comparing it with other submissions may be due mainly to a lack image databases of words and texts common to evaluate existing systems at a lack of dictionaries; can be as to the particular characteristics of the Arabic characters and the nature of cursive and printed manuscripts, and finally to a lack of exploitation of the richness of the language in terms of linguistic concepts useful in the recognition process words and texts. 2. INTEGRATION OF LINGUISTIC INFORMATION IN THE RECOGNITION PROCESS OF ARABS WORDS 2.1 Problems of the Arabic automated processing One of the complexities of the Arabic language is the lack of vowels in the text, which could generate a certain ambiguity on two levels: Meaning of the word Difficulty in identifying its function in the sentence (differentiate between the subject and the complement…). Document Recognition and Retrieval XVI, edited by Kathrin Berkner, Laurence Likforman-Sulem, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7247, 72470T · © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.805617 SPIE-IS&T/ Vol. 7247 72470T-1