American Journal of Computing Research Repository, 2014, Vol. 2, No. 1, 1-7 Available online at http://pubs.sciepub.com/ajcrr/2/1/1 © Science and Education Publishing DOI:10.12691/ajcrr-2-1-1 Word Segmentation Model for Sindhi Text Zeeshan Bhatti * , Imdad Ali Ismaili, Waseem Javaid Soomro, Dil Nawaz Hakro Institute of Information and Communication Technology, University of Sindh, Jamshoro *Corresponding author: zeeshan.bhatti@usindh.edu.pk Received November 25, 2013; Revised December 15, 2013; Accepted January 01, 2014 Abstract Through this research the problem of Sindhi Word Segmentation has been addressed and various techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite complex and rich having large number of characters in its script with all characters having multiple glyph’s based on its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each list element is a complete sentence. Then the segregated sentences are broken down into words with hard space character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. Finally each word is again filtered to remove special characters and then each word is converted and saved as token after validation. Keywords: word segmentation, sindhi tokenization, sindhi language, Sindhi Spell Checker Cite This Article: Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomroand Dil Nawaz Hakro, “Word Segmentation Model for Sindhi Text.” American Journal of Computing Research Repository 2, no. 1 (2014): 1-7. doi: 10.12691/ajcrr-2-1-1. 1. Introduction The process of segregating and isolating the sentence into individual token of words, is termed as Word segmentation or tokenization [1]. In Natural Language Processing (NLP) the term tokenization or word segmentation is deemed as the most fundamental task [2]. Almost every application of NLP requires at certain stages the process of breaking its text into individual tokens for processing -for example, in Machine Translation (MT) and Spell Checking [2,3]. The tokenization process is done by identifying word boundaries in languages like English where punctuation marks or white spaces are used to segregate words [3]. The scanning routines usually include various algorithms for handling morphology in a language-dependent manner. Even for a language like English, which is very lightly inflected, the phenomena of contraction and possessives will also need to be handled within the word extraction routines [4,5]. Sindhi, similar to other Asian languages -like Urdu, Arabic, Persian, endures the same problem of text segmentation with space omission and insertion issues. Sindhi is an official State language of Sindh province in Pakistan and is spoken by approximately 34.4 million people in Pakistan and around 2.8 million people in India [6]. Sindhi script is based on Persio-Arabic script, with Arabic Nashk style of writing, from Right-to-Left direction with cursive ligature system [6]. Sindhi script has cursive behavior in its written form, having subsequent characters; in a word, joined with each other as shown in Figure 1. Due to its cursive nature and having Aerabs (diacritics marks) makes Sindhi text difficult to process in applications of NLP. For any application of NLP it’s extremely vital that a standard corpus of a language is built so that the text can be processed and compared with some statistical analysis [7]. Therefore, the need for developing a formal Sindhi corpus is eminent and a model is needed for the tokenization of Sindhi words. This paper discusses the Sindhi word segmentation technique for the development of Sindhi corpus and tokenizing Sindhi text, to build a repository for Sindhi words for NLP applications like Spell Checkers. Sindhi word boundaries from within the text are identified by finding the hard space character. Sindhi; being a very complex language, possess fifty two characters in its script with each character having separate glyph shapes, based on the position of each character in a string. This consequently generates the case of ambiguity in Sindh in Script, as the Sindhi language contains two types of letters – connectors and non-connectors. Sindhi word therefore uses soft space as well as hard space characters as shown in Figure 1. Figure 1. Sindhi text