Australian Journal of Basic and Applied Sciences, 3(4): 4160-4169, 2009 ISSN 1991-8178 Corresponding Author: Noor Ahmed Shaikh, Assistant Professor and PhD student, Shah A. Latif University, Khairpur, Sindh, Pakistan E-mail: noor.shaikh@salu.edu.pk 4160 Character Segmentation of Sindhi, an Arabic Style Scripting Language, using Height Profile Vector Noor Ahmed Shaikh, Ghulam Ali Mallah, Zubair A. Shaikh 1 1 2 Assistant Professor and PhD student, Shah A. Latif University, Khairpur, Sindh, Pakistan 1 Professor and Director, FAST-NU, Karachi, Sindh, Pakistan, zubair.shaikh@nu.edu.pk 2 Abstract: In this paper, a problem of sub-word segmentation of printed Sindhi, an Arabic style scripting language, into characters is addressed. Printed or handwritten Sindhi text is cursive in nature. In the cursive writing, mostly the subsequent characters in a word are joined with each other. In the proposed segmentation algorithm, first of all, Height Profile Vector (HPV) of thinned primary stroke of a sub-word is calculated and analyzed for the segmentation into its constituent characters. The number and locations of possible segmentation points (PSP) are determined. The number of PSPs gives a rough estimation of the number of characters in the sub-word. The data around the last PSP is further analyzed to determine the exact number of characters in the sub-word. As the characters’ set of Sindhi is the superset set of Arabic characters’ set hence the proposed segmentation algorithm may be used for the segmentation of text written in other Arabic scripting languages. Key words: Sindhi OCR, Character Segmentation, Pattern Recognition. INTRODUCTION Character recognition is one of the most important fields of pattern recognition has been around since the development of first version of OCR in 1950’s (Mori, S., C.Y. Suen, 1992). Since then several character recognition systems have been proposed for English, Chinese, Japanese and other similar languages that use isolated characters (Badr, B.A. and S.A. Mahmoud, 1995). Character recognition systems for other languages like Arabic and Persian are not much robust and character recognition systems for Sindhi and Urdu are still mostly in research labs, primarily due to their property that such languages are cursive in nature (Kavianafar, M. and A. Amin, 1999). Recognizing unconstrained off-line cursive writing has proven to be a very difficult task, mainly due to the difficulty of character segmentation. Because of this difficulty, several attempts have been made to recognize the sub-words instead of characters (Liying Zheng, 2006; Mandana Kavianifar and Adnan Amin, 1999; Somaya Alma’adeed, 2006). This approach can narrow down the sub-word candidates, because in the large-vocabulary several sub-words may have the same global shape (Yannikoglu, B. and P.A. Sandon, 1998). In this paper, we are addressing the segmentation of off-line printed Sindhi sub-word into characters. Sindhi is a language that uses Arabic scripting. So, the techniques used for the segmentation of Arabic script may be used for Sindhi script and vice versa. Since the segmentation of Arabic script into characters is more difficult. So, many segmentation systems do not segment into characters but some other units or parts which are easier to segment. Elgammal et. al. (2001) segmented the words into small connected segments called ‘scripts’. The organization of this paper is as follows: Section 2 presents Introduction to Sindhi. In Section 3, the related work for Arabic character segmentation is presented. The proposed method for the segmentation of Sindhi characters is presented in Section 4. In Section 5, 6 and 7 conclusions, future work and the acknowledgements are presented respectively. 2. Introduction to Sindhi: Sindhi is an Indo-Aryan language having roots in Lower Indus River Valley. The name has been driven from Sindhu, the ancient name of the river Indus. Sindhi is the third major spoken language of Pakistan and over 30 million people in Pakistan and India speak Sindhi. Beyond the Indian sub-continent, it is also spoken