Research Article
Offline Pashto Characters Dataset for OCR Systems
Sulaiman Khan ,
1,2
Habib Ullah Khan ,
2
and Shah Nazir
1
1
Department of Computer Science, University of Swabi, Swabi, Pakistan
2
Department of Accounting and Information Systems, College of Business and Economics, Qatar University, Doha, Qatar
Correspondence should be addressed to Habib Ullah Khan; habib.khan@qu.edu.qa
Received 19 May 2021; Accepted 17 July 2021; Published 27 July 2021
Academic Editor: Marimuthu Karuppiah
Copyright©2021SulaimanKhanetal.isisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In computer vision and artificial intelligence, text recognition and analysis based on images play a key role in the text retrieving
process. Enabling a machine learning technique to recognize handwritten characters of a specific language requires a standard
dataset. Acceptable handwritten character datasets are available in many languages including English, Arabic, and many more.
However, the lack of datasets for handwritten Pashto characters hinders the application of a suitable machine learning algorithm
for recognizing useful insights. In order to address this issue, this study presents the first handwritten Pashto characters image
dataset (HPCID) for the scientific research work. is dataset consists of fourteen thousand, seven hundred, and eighty-four
samples—336 samples for each of the 44 characters in the Pashto character dataset. Such samples of handwritten characters are
collected on an A4-sized paper from different students of Pashto Department in University of Peshawar, Khyber Pakhtunkhwa,
Pakistan. On total, 336 students and faculty members contributed in developing the proposed database accumulation phase. is
dataset contains multisize, multifont, and multistyle characters and of varying structures.
1. Introduction
A significant work has been reported for the automatic
recognition of text or characters in images using different
classification and recognition techniques in the last de-
cade. Consequently, in any OCR system, a systematic
database must be created by collecting and scanning
written documents, which is easier nowadays due to the
availability of high definition cameras, mobile phones,
and many more. In spite of this abundance, many datasets
are developed in many languages including Arabic, Urdu,
English, Japanese, Chinese, and many others except
Pashto language. Pashto is followed as the state language
of Afghanistan and a major language in the northern areas
of Khyber Pakhtunkhwa province of Pakistan. During the
census of 2009, it was concluded that approximately 60
millions of people around the globe can speak or un-
derstand this language.
Urdu, Persian, and Arabic are considered as sister
languages of Pashto language, as the latter has either bor-
rowed or modified the characters of these languages in order
to form an indigenous dataset of 44 characters. Multiple
characters including machine printed and handwritten
formats databases are developed in those languages except
Pashto language. Bouressace and Csirik developed a dataset
that is named printed Arabic characters for OCR systems [1].
is dataset is developed by slicing text from ten different
Arabic newspapers comprising 2954 varying characters
images. is dataset contains letters varying in font and
styles and contains blurring effects and varying resolutions.
Abdalkafor performed a survey study on different datasets
developed in Arabic language [2]. Some researchers like
Khan et al. [3–5] and Ma’adeed et al. [6] have developed
datasets for their research, but these datasets are limited to
their work and not available online. Pechwitz et al. [7]
developed a database for Arabic languages known as the
IFN/ENET database. is dataset contains more than 2200
binary images collected from 411 different scholars. It
comprised of 26000 isolated words images. It contains the
information of Tunisian cities and villages.
is research work presents the first handwritten Pashto
characters database for the research community in the
computer science field. is dataset consists of 14784 binary
images of carved samples with labels for the automatic
Hindawi
Security and Communication Networks
Volume 2021, Article ID 3543816, 7 pages
https://doi.org/10.1155/2021/3543816