International Journal of Engineering Research and Development e-ISSN: 2278-067X, p-ISSN: 2278-800X, www.ijerd.com Volume 3, Issue 2 (August 2012), PP. 47-53 47 Document Image Segmentation for Analyzing of Data in Raster Image Dr .P.Sengottuvelan 1 , Mr.R.Arulmurugan 2 , Mr.R.Lokeshkumar 3 1,2,3 Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam, India Abstract––This paper focuses on the needs of developing an automated Digital Library Management system .The purpose is to automate the task of analyzing data containing in raster image documents for the purpose of intelligent information retrieval in digital library. An efficient and computationally fast method for segmenting text and graphics part of document images based on multi-scale wavelet analysis and statistical pattern recognition is presented. The extracted text is further classified into Title, Author name, name of the publication etc and being stored in the database for further Library related operations. We do not assume any a priori information regarding the font size, scanning resolution, type of layout, etc. of the document in our segmentation scheme. Keywords–––Document segmentation, daubechies wavelet, Multiscale wavelet analysis, priori Information, Fourier transform. I. INTRODUCTION In Today’s world, automated processing and reading of documents has become an imperative need with efforts have been made to store the documents in digitized form, but that requires an enormous storing space, even after compression using modern techniques. Documents can be more effectively represented by separating the text and the graphics/image part and storing the text as an ASCII (character) set and the graphics/image part as bit-maps. Document image segmentation plays an important role because this facilitates efficient searching and storage of the text part in documents, required in large databases. Consequently, several researchers have attempted different techniques to segment the text and graphics part in document images [1]. Several useful techniques for text–graphics segmentation are given in, the most popular amongst these being the top-down and bottom-up approaches[2]-[4]. The most common top-down techniques are run-length smoothing and projection profiles. Top-down approaches first split the document into blocks, which are then identified and subdivided appropriately in terms of columns first and then into paragraphs, text lines, and maybe also words[5]-[8]. Some assume these blocks to be only rectangular. The top-down methods are not suitable for skewed texts, as these methods are restricted to rectangular blocks, whereas the bottom-up methods are typically variants of the connected components which iteratively group together components of the same type starting from the pixel level and form higher level descriptions of the printed regions of the document (words, text lines, paragraphs etc.). The drawbacks with the connected components method is that it is sensitive to character size, scanning resolution, inter-line, and inter-character spacing. A wavelet-based tool has been designed by them for distinguishing text from non text regions and characterization of font sizes [8]. Some of the common difficulties that occur in documents are given below: Differences in font size, column layout, orientation, and other textual attributes. Skewed documents and text regions with different orientations. Degraded documents due to improper scanning. Combinations of varying text and background gray levels. Text regions touching or overlapping with non-text regions. Irregular layout structures with non-convex or overlapping object boundaries. Multicolumn document with misaligned text lines and different languages. Thus to develop a full fledged system all the above said difficulties should be overcome. II. PROPOSED SYSTEM In the system thus proposed a technique called document image segmentation is used where text such as Title, Author name, name of the publication etc.. is being extracted from the image being scanned (front cover of the book) classified accordingly and is being stored in the database for further Library related operations. Thus the proposed system avoids the need for manual entry of the information in the database.