AbstractThis paper discusses the Urdu script characteristics, Urdu Nastaleeq and a simple but a novel and robust technique to recognize the printed Urdu script without a lexicon. Urdu being a family of Arabic script is cursive and complex script in its nature, the main complexity of Urdu compound/connected text is not its connections but the forms/shapes the characters change when it is placed at initial, middle or at the end of a word. The characters recognition technique presented here is using the inherited complexity of Urdu script to solve the problem. A word is scanned and analyzed for the level of its complexity, the point where the level of complexity changes is marked for a character, segmented and feeded to Neural Networks. A prototype of the system has been tested on Urdu text and currently achieves 93.4% accuracy on the average. KeywordsCursive Script, OCR, Urdu. I. INTRODUCTION RDU is the national language of Pakistan, is spoken by more than 60 million speakers in over 20 countries [2].It is a cursive script, written from right to left, like Arabic and Farsi but with some additional alphabets, therefore OCRs used for Arabic or Farsi will not suit the needs for Urdu script. In this paper a character is segmented using a three steps approach, firstly, lines of text are identified, secondly words are identified and thirdly each character is segmented and extracted from a word/sub-word using its complexity level to be feeded to neural network for final recognition/classification. The main focus of the paper is character segmentation and extraction from a word or sub-word, text lines, words identification and Neural Networks used for character segmentation and identification has not been described in detail. II. URDU SCRIPT Urdu is one of the popular Indian script in the Indian subcontinent and national language of Pakistan evolved in the subcontinent from the mixture of Arabic, Turkish, Farsi and Hindi Languages with 58 character set defined by National Language Authority Pakistan as shown in Fig. 1.But only 40 basic and one do-chashmi-hey is used to form all composite alphabets; so a total of 41 alphabets Urdu shares a common script and many characteristics of Arabic script with additional set of alphabets. Authors are with the Center for Computing, Institute of Management Sciences, Peshawar, Pakistan. Fig. 1 Character Set (58 alphabets) of Urdu Script Most of Urdu characters when combined form a degree of about 45 to the horizontal line because of which Urdu script reading is faster than roman script but on the other hand it makes it harder for the novice readers and the machines to recognize the word or segment one character from the rest. Unlike the English script there is no capital or small characters in Urdu, but the last character of a word can be considered as a capital character as in many cases it presents the full form of the character and the characters at initial and middle positions are considered as small. Every character has a stand alone shape besides different joining forms, but some of the alphabet like the characters making the word Urdu (ودرا) or of the similar category are not joinable or cannot be connected. Urdu alphabet utilizes consonant letters, vowels, diacritic marks, numerals, punctuations and a few superscripts signs. The graphic representation of each alphabet has more than one form depending on its position and context in the word. In general each letter has four forms that is beginning, middle, final and standalone as shown in Table I. TABLE I CHARACTERS AND ITS DIFFERENT FORMS Char Forms Name # رح ف لاﮎشاﺳﻢ اName 0 ء ہزمہhamzah 1 ا ا فلاalif 1a ﻣﺪ اﻟﻒalif madd 2 ب ببب ےبbē 2h هب هب ےهبbhē 3 پ پپپ ےپpē 3h هپ هپ ےهپphē 4 ت تتت ےتtē 4h هت هت ےهتthē Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais Adnan Urdu Nastaleeq Optical Character Recognition U World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:1, No:8, 2007 2374 International Scholarly and Scientific Research & Innovation 1(8) 2007 scholar.waset.org/1307-6892/1702 International Science Index, Computer and Information Engineering Vol:1, No:8, 2007 waset.org/Publication/1702