Journal of Theoretical and Applied Information Technology
15
th
July 2018. Vol.96. No 13
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
4191
A NOVEL HYBRID APPROACH FOR FEATURE
EXTRACTION IN MALAYALAM HANDWRITTEN
CHARACTER RECOGNITION
1
AJAY JAMES,
2
SUJALA K,
3
CHANDRAN SARAVANAN
1
Assistant Professor, Department of Computer Science and Engineering, Govt. Engineering College
Thrissur, India
2
Mtech Student, Department of Computer Science and Engineering, Govt. Engineering College Thrissur,
India
3
Associate Professor, Department of Computer Science and Engineering, National Institute of Technology
West Bengal, India
E-mail:
1
ajayjames80@gmail.com,
2
sujala58@gmail.com,
3
dr.cs1973@gmail.com
ABSTRACT
Optical Character Recognition (OCR) is defined as the process of segregating textual scripts from a
scanned document. To develop a digitally empowered society, information is made available in digital
form. The OCR software assists in digitization of documents in different languages. Many researches are
working on digitization of documents particularly to develop effective and error free character recognition
models. To develop a digitally empowered society, information should be made digitally available. There
arises the need for an OCR software in different languages. Malayalam handwritten character recognition
precision is still inhibited around 90% due to the confrontations in Malayalam character set. The
omnipresence of two different scripts old and new script, huge character set, ubiquity of similar shaped
characters makes Malayalam handwritten character recognition more difficult. Feature extraction for each
language may vary depending on various characteristics of the language. By observing the shape patterns in
each language, different novel methods are developed to extract features and also to recognize the same. In
this research, a novel hybrid approach is proposed which uses a combination of statistical and structural
features (SSF). The statistical features are those derived from the statistical dissipating of pixels. Structural
features are based on the topological and geometrical properties of the character. This study gives insight to
the fact that combination of statistical and structural features gives more accuracy in Malayalam character
recognition.
Keywords— Optical Character Recognition, Binarization, Feature Extraction, Classification, Machine
Recognition, Decision Tree.
1. INTRODUCTION
Optical character recognition has many
applications in day to day life. Nowadays,
digitization is in high priority and machine
recognition of characters is part of it. Researches
in this area helps to develop a digitally
empowered society. Many old documents in
written or printed form are easily converted to
editable text by means of digitization. OCR is a
technology that enables us to convert different
types of documents, such as scanned paper
documents into editable and searchable data. For
example, a written or printed document to be sent
to our partner via email or other electronic
medium by digitizing as an image other file
forms. OCR helps to scan the document and store
it as editable document for modifying the
scanned document. In order to extract and
repurpose data from scanned documents, OCR
software is required that would single out letters
on the image, put them into words and then
words into sente`1nces, thus enabling accessing
and editing the content of the original document.
Handwritten character extraction and
recognition can ameliorate the human computer
interaction and better integrate computers into
human society. The handwritten text recognition