International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 09 | Sep 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1560 A Text Extraction Approach towards Document Image Organisation for the Android Platform M. Madhuram 1 , Aruna Parameswaran 2 1 Assistant Professor, Department of Computer Science and Engineering SRM Institute of Science and Technology, Chennai, India 2 Student, Department of Computer Science and Engineering SRM Institute of Science and Technology, Chennai, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – This paper demonstrates the implementation of a comprehensive gallery application based on the Android platform to handle document images. The fundamental objective of this app is to provide an automated interface for performing document image processing, text extraction, and organisation of relevant information. Over time, the widespread use of smartphones has resulted in the exponential growth of data. Despite the recent advancements in computing, storage techniques, and image processing, there is still a demand for highly optimised information retrieval methods, particularly for smartphones. Manual organisation of document images is an extremely time-consuming and tedious task. The proposed system presents an alternative approach by using Optical Character Recognition (OCR). The app employs the Firebase platform for detecting and extracting text. It also incorporates a Keyword Search feature, based on the KMP Algorithm, hence facilitating faster access to information. Additionally, this paper proposes a simple method for content-based categorisation of document images using a Bag-of-Words model. Key Words: Android Application, Document Image Analysis, Firebase, Image Processing, Optical Character Recognition software (OCR), Text Extraction 1. INTRODUCTION Over the recent years, there has been a massive influx of data, particularly in the digitised form. As of 2017, it was reported that 90% of the world’s data had been created in the span of just two years, at a rapid rate of 2.5 quintillion bytes a day. Consequently, data storage and information retrieval techniques have become more significant. Traditionally, data was stored physically in the paper format. However, recent advancements in computing, storage techniques (such as Big Data), and image processing have subsequently resulted in a shift towards digitised content. Concentrated studies and research work on document images are being conducted presently. Document images vary from regular images in terms of their structural and spatial properties. Hence, in order to effectively process them, their text contents must be utilised. The term 'Document Image Analysis' (DIA) is used to refer to the set of techniques which perform digital conversion, text or image detection, and meaningful organisation of information. One of the most commonly used techniques in DIA is Optical Character Recognition (OCR). OCR is used to convert the printed or handwritten text present physically in a scanned document, image or scene-photo to an appropriate electronic format. In some way or the other, every individual handles document images on a daily basis. Commonly used platforms for communication include messaging apps, e-mail and cloud storage services (Google Drive, Dropbox etc.). An example of commonly shared document images are the scanned class notes which are distributed to students and teachers. Other common examples include bills, receipts and even identity documents (such as passport, PAN card). Improvements in the standard of smartphone cameras, higher on-board storage, and the widespread availability of mobile scanner apps have collectively facilitated ease of information exchange. The problem arises, however, when there is no singular method to organise large volumes of information. Eventually, this results in data misplacement and loss. As there may be multiple sources for these images, an automated system for syncing dynamically with the storage and further managing the information is required. The proposed work overcomes these issues by implementing features such as Keyword Search and Content-based Categorisation of Images. The rest of this paper is organised as follows: Section II presents an overview of the literature survey done. Section III introduces the proposed system and proceeds to elaborate on its architecture and the purpose of each individual module. Section IV explains the algorithms used in the implementation of the system and the results are expanded upon in Section V. Section VI summarises the observations along with final concluding remarks and addresses the future scope of research. 2. LITERATURE REVIEW Through the years, extensive research work has been done on Optical Character Recognition (OCR) and text extraction techniques. Increase in the volume of digitised information has further escalated these efforts. In this section, we shall review some of these systems and discuss about them in detail.