BRIEF COMMUNICATION
Natural Language Processing: Word Recognition without
Segmentation
Khalid Saeed and Agnieszka Dardzin ´ ska
Computer Engineering Department, Faculty of Computer Science, Bialystok University of Technology,
Wiejska 45A, 15 351 Bialystok, Poland
In an earlier article about the methods of recognition of
machine and hand-written cursive letters, we presented
a model showing the possibility of processing, classify-
ing, and hence recognizing such scripts as images. The
practical results we obtained encouraged us to extend
the theory to an algorithm for word recognition. In this
article, we introduce our ideas, describe our achieve-
ments, and present our results of testing words for rec-
ognition without segmentation. This would lead to the
possibility of applying the methods used in this work,
together with other previously developed algorithms to
process whole sentences and, hence, written and spo-
ken texts with the goal of automatic recognition.
Introduction
This article extends our earlier research (Saeed, 1998,
1999, 2000a, 2000b; Saeed & Dardzin ´ska, 2000) on the
automatic recognition of hand and machine-written cursive
text using the Arabic alphabet (which is used not only for
Arabic but for Farsi and a number of other languages).
When segmentation is considered, the problem of overlap-
ping arises very often (Burr, 1998; Dehgham, & Faez, 1998;
Ghuwar, 1997; Saeed, 200a, 2000b). As this significant
problem is really difficult to overcome in language recog-
nition, it is of great importance to consider the word recog-
nized, as one image, as far as possible. This simplifies and
improves the recognition performance.
As an image simulation example consider the Arabic
word (pronounced bint, meaning a girl). This word
reminds us, in its thinned skeleton (Fig.1), of a part of an
electrocardiogram cycle image, for example, or a part of a
spoken letter waveform outline, and so on. The thinning has
been done according to the algorithm given in Saeed &
Niedzielski (1999), and its modification into an uninter-
rupted skeleton (Saeed, 2001a; Saeed, Rybnik, & Tabedzki,
2001).
In fact, this allows us to extend the algorithm to cover
images other than written texts. Most classical methods of
classification require word segmentation as a basic prepro-
cessing stage. After segmentation, each letter is treated as a
separate image. There exist many algorithms for script
recognition (Burr, 1998; Delgham & Faez, 1998; Ghuwar,
1999; Saeed, 2000b; Saeed & Niedzielski, 1999; Saeed et
al., 2001). However, in this work we present a way of
recognizing written texts by considering classification of
their words or subwords without separating the letters. The
basic approaches and methods followed here are the ideas
and algorithms used in script recognition developed by the
authors in their previous work (Saeed, 1998, 2000b, 2001b;
Saeed & Dardzinska, 2000). Figure 2 shows the word
segmented into its separated letters.
The problem of recognizing Arabic script, however, dif-
fers from the problem of recognizing cursive script in En-
glish, because changing the place of dots or their number in
a given letter produces a different word. Figure 3, however,
shows the same word segmentation, but this time the word
was hand written. As can be seen, the word has three letters,
each of which has one or two dots. These letters may be
joined to give different combinations of other words, of
completely different meanings, as will be shown in the
Experiments and Future Aspects section. For instance, the
Received November 30, 2000; Revised March 19, 2001; accepted June 8,
2001
© 2001 John Wiley & Sons, Inc.
Published online 25 October 2001 ● DOI: 10.1002/asi.1192 FIG. 1. The machine-written word and its thinned shape.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 52(14):1275–1279, 2001