International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 09 | Sep 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1560
A Text Extraction Approach towards Document Image Organisation for
the Android Platform
M. Madhuram
1
, Aruna Parameswaran
2
1
Assistant Professor, Department of Computer Science and Engineering SRM Institute of Science and Technology,
Chennai, India
2
Student, Department of Computer Science and Engineering SRM Institute of Science and Technology,
Chennai, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – This paper demonstrates the implementation of
a comprehensive gallery application based on the Android
platform to handle document images. The fundamental
objective of this app is to provide an automated interface for
performing document image processing, text extraction, and
organisation of relevant information. Over time, the
widespread use of smartphones has resulted in the exponential
growth of data. Despite the recent advancements in
computing, storage techniques, and image processing, there is
still a demand for highly optimised information retrieval
methods, particularly for smartphones. Manual organisation
of document images is an extremely time-consuming and
tedious task. The proposed system presents an alternative
approach by using Optical Character Recognition (OCR). The
app employs the Firebase platform for detecting and
extracting text. It also incorporates a Keyword Search feature,
based on the KMP Algorithm, hence facilitating faster access to
information. Additionally, this paper proposes a simple
method for content-based categorisation of document images
using a Bag-of-Words model.
Key Words: Android Application, Document Image Analysis,
Firebase, Image Processing, Optical Character Recognition
software (OCR), Text Extraction
1. INTRODUCTION
Over the recent years, there has been a massive influx of
data, particularly in the digitised form. As of 2017, it was
reported that 90% of the world’s data had been created in
the span of just two years, at a rapid rate of 2.5 quintillion
bytes a day. Consequently, data storage and information
retrieval techniques have become more significant.
Traditionally, data was stored physically in the paper format.
However, recent advancements in computing, storage
techniques (such as Big Data), and image processing have
subsequently resulted in a shift towards digitised content.
Concentrated studies and research work on document
images are being conducted presently. Document images
vary from regular images in terms of their structural and
spatial properties. Hence, in order to effectively process
them, their text contents must be utilised. The term
'Document Image Analysis' (DIA) is used to refer to the set of
techniques which perform digital conversion, text or image
detection, and meaningful organisation of information. One
of the most commonly used techniques in DIA is Optical
Character Recognition (OCR). OCR is used to convert the
printed or handwritten text present physically in a scanned
document, image or scene-photo to an appropriate
electronic format.
In some way or the other, every individual handles
document images on a daily basis. Commonly used platforms
for communication include messaging apps, e-mail and cloud
storage services (Google Drive, Dropbox etc.). An example of
commonly shared document images are the scanned class
notes which are distributed to students and teachers. Other
common examples include bills, receipts and even identity
documents (such as passport, PAN card). Improvements in
the standard of smartphone cameras, higher on-board
storage, and the widespread availability of mobile scanner
apps have collectively facilitated ease of information
exchange.
The problem arises, however, when there is no singular
method to organise large volumes of information. Eventually,
this results in data misplacement and loss. As there may be
multiple sources for these images, an automated system for
syncing dynamically with the storage and further managing
the information is required. The proposed work overcomes
these issues by implementing features such as Keyword
Search and Content-based Categorisation of Images.
The rest of this paper is organised as follows: Section II
presents an overview of the literature survey done. Section
III introduces the proposed system and proceeds to
elaborate on its architecture and the purpose of each
individual module. Section IV explains the algorithms used in
the implementation of the system and the results are
expanded upon in Section V. Section VI summarises the
observations along with final concluding remarks and
addresses the future scope of research.
2. LITERATURE REVIEW
Through the years, extensive research work has been
done on Optical Character Recognition (OCR) and text
extraction techniques. Increase in the volume of digitised
information has further escalated these efforts. In this
section, we shall review some of these systems and discuss
about them in detail.