BRIEF COMMUNICATION Natural Language Processing: Word Recognition without Segmentation Khalid Saeed and Agnieszka Dardzin ´ ska Computer Engineering Department, Faculty of Computer Science, Bialystok University of Technology, Wiejska 45A, 15 351 Bialystok, Poland In an earlier article about the methods of recognition of machine and hand-written cursive letters, we presented a model showing the possibility of processing, classify- ing, and hence recognizing such scripts as images. The practical results we obtained encouraged us to extend the theory to an algorithm for word recognition. In this article, we introduce our ideas, describe our achieve- ments, and present our results of testing words for rec- ognition without segmentation. This would lead to the possibility of applying the methods used in this work, together with other previously developed algorithms to process whole sentences and, hence, written and spo- ken texts with the goal of automatic recognition. Introduction This article extends our earlier research (Saeed, 1998, 1999, 2000a, 2000b; Saeed & Dardzin ´ska, 2000) on the automatic recognition of hand and machine-written cursive text using the Arabic alphabet (which is used not only for Arabic but for Farsi and a number of other languages). When segmentation is considered, the problem of overlap- ping arises very often (Burr, 1998; Dehgham, & Faez, 1998; Ghuwar, 1997; Saeed, 200a, 2000b). As this significant problem is really difficult to overcome in language recog- nition, it is of great importance to consider the word recog- nized, as one image, as far as possible. This simplifies and improves the recognition performance. As an image simulation example consider the Arabic word (pronounced bint, meaning a girl). This word reminds us, in its thinned skeleton (Fig.1), of a part of an electrocardiogram cycle image, for example, or a part of a spoken letter waveform outline, and so on. The thinning has been done according to the algorithm given in Saeed & Niedzielski (1999), and its modification into an uninter- rupted skeleton (Saeed, 2001a; Saeed, Rybnik, & Tabedzki, 2001). In fact, this allows us to extend the algorithm to cover images other than written texts. Most classical methods of classification require word segmentation as a basic prepro- cessing stage. After segmentation, each letter is treated as a separate image. There exist many algorithms for script recognition (Burr, 1998; Delgham & Faez, 1998; Ghuwar, 1999; Saeed, 2000b; Saeed & Niedzielski, 1999; Saeed et al., 2001). However, in this work we present a way of recognizing written texts by considering classification of their words or subwords without separating the letters. The basic approaches and methods followed here are the ideas and algorithms used in script recognition developed by the authors in their previous work (Saeed, 1998, 2000b, 2001b; Saeed & Dardzinska, 2000). Figure 2 shows the word segmented into its separated letters. The problem of recognizing Arabic script, however, dif- fers from the problem of recognizing cursive script in En- glish, because changing the place of dots or their number in a given letter produces a different word. Figure 3, however, shows the same word segmentation, but this time the word was hand written. As can be seen, the word has three letters, each of which has one or two dots. These letters may be joined to give different combinations of other words, of completely different meanings, as will be shown in the Experiments and Future Aspects section. For instance, the Received November 30, 2000; Revised March 19, 2001; accepted June 8, 2001 © 2001 John Wiley & Sons, Inc. Published online 25 October 2001 DOI: 10.1002/asi.1192 FIG. 1. The machine-written word and its thinned shape. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 52(14):1275–1279, 2001