Research Article Offline Pashto Characters Dataset for OCR Systems Sulaiman Khan , 1,2 Habib Ullah Khan , 2 and Shah Nazir 1 1 Department of Computer Science, University of Swabi, Swabi, Pakistan 2 Department of Accounting and Information Systems, College of Business and Economics, Qatar University, Doha, Qatar Correspondence should be addressed to Habib Ullah Khan; habib.khan@qu.edu.qa Received 19 May 2021; Accepted 17 July 2021; Published 27 July 2021 Academic Editor: Marimuthu Karuppiah Copyright©2021SulaimanKhanetal.isisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In computer vision and artificial intelligence, text recognition and analysis based on images play a key role in the text retrieving process. Enabling a machine learning technique to recognize handwritten characters of a specific language requires a standard dataset. Acceptable handwritten character datasets are available in many languages including English, Arabic, and many more. However, the lack of datasets for handwritten Pashto characters hinders the application of a suitable machine learning algorithm for recognizing useful insights. In order to address this issue, this study presents the first handwritten Pashto characters image dataset (HPCID) for the scientific research work. is dataset consists of fourteen thousand, seven hundred, and eighty-four samples—336 samples for each of the 44 characters in the Pashto character dataset. Such samples of handwritten characters are collected on an A4-sized paper from different students of Pashto Department in University of Peshawar, Khyber Pakhtunkhwa, Pakistan. On total, 336 students and faculty members contributed in developing the proposed database accumulation phase. is dataset contains multisize, multifont, and multistyle characters and of varying structures. 1. Introduction A significant work has been reported for the automatic recognition of text or characters in images using different classification and recognition techniques in the last de- cade. Consequently, in any OCR system, a systematic database must be created by collecting and scanning written documents, which is easier nowadays due to the availability of high definition cameras, mobile phones, and many more. In spite of this abundance, many datasets are developed in many languages including Arabic, Urdu, English, Japanese, Chinese, and many others except Pashto language. Pashto is followed as the state language of Afghanistan and a major language in the northern areas of Khyber Pakhtunkhwa province of Pakistan. During the census of 2009, it was concluded that approximately 60 millions of people around the globe can speak or un- derstand this language. Urdu, Persian, and Arabic are considered as sister languages of Pashto language, as the latter has either bor- rowed or modified the characters of these languages in order to form an indigenous dataset of 44 characters. Multiple characters including machine printed and handwritten formats databases are developed in those languages except Pashto language. Bouressace and Csirik developed a dataset that is named printed Arabic characters for OCR systems [1]. is dataset is developed by slicing text from ten different Arabic newspapers comprising 2954 varying characters images. is dataset contains letters varying in font and styles and contains blurring effects and varying resolutions. Abdalkafor performed a survey study on different datasets developed in Arabic language [2]. Some researchers like Khan et al. [3–5] and Ma’adeed et al. [6] have developed datasets for their research, but these datasets are limited to their work and not available online. Pechwitz et al. [7] developed a database for Arabic languages known as the IFN/ENET database. is dataset contains more than 2200 binary images collected from 411 different scholars. It comprised of 26000 isolated words images. It contains the information of Tunisian cities and villages. is research work presents the first handwritten Pashto characters database for the research community in the computer science field. is dataset consists of 14784 binary images of carved samples with labels for the automatic Hindawi Security and Communication Networks Volume 2021, Article ID 3543816, 7 pages https://doi.org/10.1155/2021/3543816