A New Dataset of Persian Handwritten Documents and its Segmentation Alireza Alaei Department of Studies in Computer Science, University of Mysore Mysore, 570006, India alireza20alaei@yahoo.com P. Nagabhushan Department of Studies in Computer Science, University of Mysore Mysore, 570006, India pnagabhushan@hotmail.com Umapada Pal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata–108, India umapada@isical.ac.in Abstract—In document image analysis and especially in handwritten document image recognition, standard datasets play vital roles for evaluating performances of algorithms and comparing results obtained by different groups of researchers. In this paper, an unconstrained Persian handwritten text dataset (PHTD) is introduced. The PHTD contains 140 handwritten documents of three different categories written by 40 individuals. Total number of text-lines and words/subwords in the dataset are 1787 and 27073, respectively. In most of the PHTD documents either an overlapping or a touching text-lines is present. The average number of text-lines in documents of the PHTD is 13. Two types of ground truths based on pixels information and content information are generated for the dataset. Providing these two types of ground truths for the PHTD, it can be utilized in many areas of document image processing such as sentence recognition/understanding, text-line segmentation, word segmentation, word recognition, and character segmentation. To provide a framework for other researches, recent text-line segmentation results on this dataset are also reported. Keywords: Handwritten document; Persian handwritten recognition; Persian handwritten dataset; Ground truth I. INTRODUCTION Many researchers have been working on the recognition of handwritten documents and many standard datasets have been developed to help researchers sharing their results on the same datasets and comparing performances of their algorithms [1– 17]. For some scripts such as Roman, Chinese, and Korean many standard datasets are available in literature [1–9]. The NIST [1], MNIST [2], CENPARMI [3], CEDAR [4] are some examples of widely used English datasets for numerals, characters, words and text-pages. The NIST standard dataset SD3 [1] contains one of the largest selections of isolated digits and characters. Altogether, it contains over 300000 character images that were extracted from data collection forms filled out by 2100 individuals. The images on SD3 were provided as a training set. A separate dataset (TD1) was developed for testing. The quality of the data varied more widely in TD1 than in SD3. A significant amount of running English text is also available in the NIST SD3 and TD1 datasets. Three other NIST datasets (SD11-SD13) contain examples of phrases. Altogether, 91500 handwritten phrases were scanned at 200DPI in binary mode. The CENPARMI dataset contains 17000 isolated digits that were extracted from images of about 3400 postal ZIP Codes [3]. The isolated digits were extracted from the ZIP Codes by a manual segmentation. The CEDAR dataset as one of the available English dataset contains nearly 28000 isolated handwritten characters and digits that were extracted from images of US postal addresses [4]. Individual characters and digits were segmented from the address images by a semi- automatic process. Approximately 5000 ZIP Codes, 5000 city names, and 9000 state name images are included in [4]. In spite of availability of many datasets for isolated characters and digits, to the best of our knowledge only a few datasets are available for handwritten text-lines [5-7]. Recently, a benchmark dataset of 300 text-pages were made available for text-line and word segmentation purpose [5]. The ground truths for text-line and word segmentation were manually annotated. This dataset was used for ICDAR2009 segmentation contest. One hundred document images and associated ground truths were considered for training dataset and 200 images and associated ground truth were used for testing dataset. The documents used in order to build the training and test datasets did not include any non-text elements and the documents were written by 50 individuals in English, French, German and Greek languages [5]. Because of easy availability of different datasets in English, different types of research starting from numeral/character to sentence recognition using natural language processing [20] are available in literature. In Indian languages domain, there are also a few standard datasets available although India is a multilingual, multi-script country. Recently, researchers have provided some datasets for off-line handwritten Bangla and Devanagari numerals, characters, and text-pages [10–12]. The database introduced in [12] contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. In Persian and Arabic languages, there exists a few handwritten datasets for characters, numerals, words as well as text-lines [13–19]. Among these datasets only the dataset introduced in [19] contains Persian texts. This dataset consists of 1000 filled forms written by 250 individuals and can be used