International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 05 | May-2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 3968
Optical Character Recognition for Hindi
Prasanta Pratim Bairagi
Assistant Professor, Department of CSE, Assam down town University, Assam, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract -Optical Character Recognition is a system which
can perform the translation of images from handwritten or
printed form to machine-editable form. Devanagari script is
used in many Indian languages like Hindi, Nepali, Marathi,
Sindhi etc. This script forms the foundation of the language like
Hindi which is the national and most widely spoken language
in India. In current scenario, there is a huge demand in “storing
the information in digital format available in paper documents
and then later reusing this information by searching process”.
In this paper we propose a new method for recognition of
printed Hindi characters in Devanagari script. In this project
different pre-processing operations like features extraction,
segmentations and classification have been studied and
implemented in order to design a sophisticated OCR system for
Hindi based on Devanagari script. During this research,
different related research papers on existing OCR systems have
been studied. In this project the main emphasis is given
towards the recognitions of the individual consonants and
vowels which can be later extended to recognize complex
derived letters & words.
Key Words: Optical Character Recognition, Feature
Extraction, Segmentation, Hindi Character, Devanagari
Script
1. INTRODUCTION
The introduction part is divided into two individual parts.
The first part defines about OCR, its types and its uses and
the second part defines about Devanagari script, the
foundation of Hindi language.
1.1 About OCR
Optical Character Recognition has emerged as a major
research area since 1950. Optical Character Recognition is
the mechanical or electronic translation of images of
handwritten or printed text into machine-editable text [1].
The images are usually captured by a scanner. However,
throughout the text, we would be referring to printed text by
OCR. Data Entry through OCR is relatively fast, more
accuracy, and generally more efficiency than usual keyboard
entry. An OCR system enables us to store a book or a
magazine article directly into digital form and also make it
editable. Development of OCR for Indian script is an active
area of research and it also gives great challenges to design
an OCR due to the large number of letters in the alphabet, the
sophisticated ways in which they combine, and the
complicated graphemes they result in. Usually in Devanagari
script, there is no separation between the characters written
in a text. In this research work different pre-processing
operations like conversion of gray scale images to binary
images, image rectification and segmentation are considered
in order to design this system.
1.2 Types of OCR
Basically, there are three types of OCR. They are briefly
discussed below:
Offline Handwritten Text
The text produced by a person by writing with a pen/
pencil on a paper and then scanned the document to
digitalized them is called Offline Handwritten Text.
Online Handwritten Text
Online handwritten text is the one written directly on a
digital platform using different digital device. The output is a
sequence of x-y coordinates that express pen position as well
as other information such as pressure and speed of writing.
Machine Printed Text
Machine printed texts are commonly found in printed
documents and it is produced by offset processes.
1.3 Uses of OCR
Optical Character Recognition is used to scan different
types of documents such as PDF files or images and convert
them into editable file.
The OCR system is used for the following purposes:
Processing Bank cheese
Documenting library materials into digital
format.
Storing documents in digital form, searching text
and extracting data.
1.4 About Devanagari Script
Devanagari script is the foundation of many Indian
languages like Hindi, Nepali, Marathi, Sindhi etc and used by
more than 300 million people around the world. So
Devanagari script plays a very major role in the development
of literature and manuscripts. There is so much of literature
from the old age manuscripts, Vedas and scriptures and
since these are so old so these are not easily accessible to
everyone. The need and urge to read these old age scriptures
led to the digital conversion of these by scanning the books.
For scanning and converting the documents into editable