A Database for Arabic Handwritten Text Image Recognition and Writer Identification Anis Mezghani, Slim Kanoun, Maher Khemakhem University of Sfax MIRACL Lab, ISIMS Sfax, Tunisia {anis.mezghani, slim.kanoun}@gmail.com, maher.khemakhem@fsegs.rnu.tn Haikal El Abed Braunschweig Technical University Institute for Communications Technology (IfN) Braunschweig, Germany elabed@tu-bs.de Abstract—Standard databases play essential roles for evaluating and comparing results obtained by different groups of researchers. In this paper, an Arabic Handwritten Text Images Database written by Multiple Writers (AHTID/MW) is introduced. This database can be used for research in the recognition of Arabic handwritten text with open vocabulary, word segmentation and writer identification. The AHTID/MW contains 3710 text lines and 22896 words written by 53 native writers of Arabic. In addition, ground truth annotation is provided for each text image. The database is freely available for worldwide researchers. Keywords-Arabic Handwritten text image; AHTID/MW Database; open vocabulary; Ground truth I. INTRODUCTION In the last score of years, most of the efforts in text recognition have been focused on Latin script recognition. This is due to the availability of several databases of printed and handwritten Latin text [1, 2]. Similar databases also exist for a few other languages such as Chinese and Indian [3, 4, 5]. The existing research on recognizing Arab text is still limited. The lack of freely available Arabic databases is considered as one of the reasons for the lack of research on Arabic text recognition compared with other languages. Each of research groups implemented their system on set of data gathered by them and different recognition rates were reported. Therefore, the comparison of systems is rather difficult. In the field of Arabic handwritten text recognition, having a standard database is crucial for text image recognition and writer identification. Researchers have prepared some databases for handwritten texts [6], handwritten words [7] and bank checks [8]. To the best of our knowledge, only one database is available for Arabic handwritten text-lines with open vocabulary [9]. However, it doesn’t include a dataset of words and it seems difficult to have access to this database. Considering these issues, we propose to develop an Arabic Handwritten Text Images Database (AHTID/MW) that covers all Arabic characters and forms (beginning, middle, end, and isolated). The proposed database contains Arabic words and text-lines written by 53 different writers. Two types of ground truths based on content information (text-line image and word image) are generated. This database will be made available for the scientific community and may be used as a benchmark database where researchers can evaluate and compare their algorithms and results with other published works. In Section 2, we will outline the published works related to developing off-line Arabic handwritten databases. In Section 3, we briefly illustrate characteristics of Arabic scripts. Section 4 describes the proposed AHTID/MW database. In Section 5, the database structure along with ground truth data is presented. Finally, concluding remarks are given in Section 6. II. CURRENT ARABIC HANDWRITTEN DATABASES The last years showed an increasing interest in Arabic handwritten text recognition solutions. Starting with small private data sets to evaluate their systems, more and more researchers begin to bring more attention to the dataset standardization. For example, a database of unconstrained isolated Arabic handwritten characters written by 48 writers was used by Khedher et al. [10] for isolated character recognition research. IFHCDB (Isolated Farsi Handwritten Character Database) database was presented by Mozaffari et al. [11] for the use in optical character recognition research. It consists of 52,380 gray scale images of handwritten characters and 17,740 numerals. Al-ISRA database [12] was collected in Al-Isra University in Amman, Jordan. It contains 37,000 Arabic words, 10,000 digits, 2,500 signatures, and 500 free-form Arabic sentences gathered from five hundred students. Alma’adeed et al. presented in 2002 the AHDB database [6]. It includes images of words that are used to describe numbers, images of the most frequent words used in Arabic writing, and images of sentences used in writing legal amount on Arabic checks. Another Arabic database for research in recognition of Arabic handwritten checks (CENPARMI) was developed by Al-Ohali et al. [8]. This database can be mainly used for Arabic handwritten number, digits and limited vocabulary word recognition. A database including Arabic dates, isolated digits, numerical strings, 2012 International Conference on Frontiers in Handwriting Recognition 978-0-7695-4774-9/12 $26.00 ' 2012 IEEE DOI 10.1109/ICFHR.2012.155 397