American Journal of Computing Research Repository, 2014, Vol. 2, No. 1, 1-7
Available online at http://pubs.sciepub.com/ajcrr/2/1/1
© Science and Education Publishing
DOI:10.12691/ajcrr-2-1-1
Word Segmentation Model for Sindhi Text
Zeeshan Bhatti
*
, Imdad Ali Ismaili, Waseem Javaid Soomro, Dil Nawaz Hakro
Institute of Information and Communication Technology, University of Sindh, Jamshoro
*Corresponding author: zeeshan.bhatti@usindh.edu.pk
Received November 25, 2013; Revised December 15, 2013; Accepted January 01, 2014
Abstract Through this research the problem of Sindhi Word Segmentation has been addressed and various
techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any
tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able
to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite
complex and rich having large number of characters in its script with all characters having multiple glyph’s based on
its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various
algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating
word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is
resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each
list element is a complete sentence. Then the segregated sentences are broken down into words with hard space
character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting.
Finally each word is again filtered to remove special characters and then each word is converted and saved as token
after validation.
Keywords: word segmentation, sindhi tokenization, sindhi language, Sindhi Spell Checker
Cite This Article: Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomro, and Dil Nawaz Hakro,
“Word Segmentation Model for Sindhi Text.” American Journal of Computing Research Repository 2, no. 1
(2014): 1-7. doi: 10.12691/ajcrr-2-1-1.
1. Introduction
The process of segregating and isolating the sentence
into individual token of words, is termed as Word
segmentation or tokenization [1]. In Natural Language
Processing (NLP) the term tokenization or word
segmentation is deemed as the most fundamental task [2].
Almost every application of NLP requires at certain stages
the process of breaking its text into individual tokens for
processing -for example, in Machine Translation (MT)
and Spell Checking [2,3]. The tokenization process is
done by identifying word boundaries in languages like
English where punctuation marks or white spaces are used
to segregate words [3]. The scanning routines usually
include various algorithms for handling morphology in a
language-dependent manner. Even for a language like
English, which is very lightly inflected, the phenomena of
contraction and possessives will also need to be handled
within the word extraction routines [4,5]. Sindhi, similar
to other Asian languages -like Urdu, Arabic, Persian,
endures the same problem of text segmentation with space
omission and insertion issues.
Sindhi is an official State language of Sindh province in
Pakistan and is spoken by approximately 34.4 million
people in Pakistan and around 2.8 million people in India
[6]. Sindhi script is based on Persio-Arabic script, with
Arabic Nashk style of writing, from Right-to-Left
direction with cursive ligature system [6]. Sindhi script
has cursive behavior in its written form, having
subsequent characters; in a word, joined with each other as
shown in Figure 1. Due to its cursive nature and having
Aerabs (diacritics marks) makes Sindhi text difficult to
process in applications of NLP. For any application of
NLP it’s extremely vital that a standard corpus of a
language is built so that the text can be processed and
compared with some statistical analysis [7]. Therefore, the
need for developing a formal Sindhi corpus is eminent and
a model is needed for the tokenization of Sindhi words.
This paper discusses the Sindhi word segmentation
technique for the development of Sindhi corpus and
tokenizing Sindhi text, to build a repository for Sindhi
words for NLP applications like Spell Checkers. Sindhi
word boundaries from within the text are identified by
finding the hard space character. Sindhi; being a very
complex language, possess fifty two characters in its script
with each character having separate glyph shapes, based
on the position of each character in a string. This
consequently generates the case of ambiguity in Sindh in
Script, as the Sindhi language contains two types of letters
– connectors and non-connectors. Sindhi word therefore
uses soft space as well as hard space characters as shown
in Figure 1.
Figure 1. Sindhi text