Simultaneous Segmentation and Recognition of Arabic Printed Text
Using Linguistic Concepts of Vocabulary
Mohamed Ben Halima
1
and Adel M. Alimi
1
1
The high school of National Engineering of Sfax
B.P W.3038 Sfax-Tunisia
ABSTRACT
In this paper, we propose a new approach to Arabic printed text analysis and recognition. This approach is based on
linguistic concepts of Arabic vocabulary. For the text, we allow to categorize the words in decomposable words (derived
from a root) and indecomposable words (not derived from a root) and to put forth morpho-syntactic characterization
hypotheses for each word. For the decomposable words, we attempt to recognize word basic morphemes: antefix, prefix,
infix, suffix, postfix and root contrary to existing approaches which are usually based on recognition of word entity by
holistic approach.
Keywords: Arabic Text, Segmentation, Recognition, Linguistic Concepts
1. INTRODUCTION
The automatic processing of the Arabic language is a sector which includes issues such industrial and economic,
scientific and technical type, but also has a very specific cultural dimension.
Inspite the fact that almost three hundred million people worldwide use the Arabic alphabet, research on optical
recognition of the Arabic script is not as advanced as other writings (Latin, Japanese or Chinese). Early work focused
solely on the recognition of individual characters. The first work published in 1975 has involved writing printed [1].
Until 1980, very little work has involved writing Arabic. During this period, work has always involved the recognition of
isolated printed character. In 1980, Amin suggested the first system on the recognition of handwritten characters within
its system IRAC [2].
During the last two decades, research on the recognition of the nature of Arabic script printed and handwritten have
progressed considerably [3] [4] [5] [6] [7] [8] [9] [10] [11]. But, a lot of work is still interested in the isolated character
recognition [12] [13]. In contrast, research on the recognition of words and texts are still relatively limited compared to
the work performed for other scriptures such as the Latin script.
This time, and the shortfall in research on the recognition of the Arabic script, by comparing it with other submissions
may be due mainly to a lack image databases of words and texts common to evaluate existing systems at a lack of
dictionaries; can be as to the particular characteristics of the Arabic characters and the nature of cursive and printed
manuscripts, and finally to a lack of exploitation of the richness of the language in terms of linguistic concepts useful in
the recognition process words and texts.
2. INTEGRATION OF LINGUISTIC INFORMATION IN THE RECOGNITION PROCESS
OF ARABS WORDS
2.1 Problems of the Arabic automated processing
One of the complexities of the Arabic language is the lack of vowels in the text, which could generate a certain
ambiguity on two levels:
Meaning of the word
Difficulty in identifying its function in the sentence (differentiate between the subject and the complement…).
Document Recognition and Retrieval XVI, edited by Kathrin Berkner, Laurence Likforman-Sulem, Proc. of SPIE-IS&T
Electronic Imaging, SPIE Vol. 7247, 72470T · © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.805617
SPIE-IS&T/ Vol. 7247 72470T-1