International Journal of Digital Library Systems, 2(2), 27-54, April-June 2011 27
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global
is prohibited.
Keywords: Document Image Segmentation, Index Page Detection, Math-Zone Detection, Table Detection,
Tabular Structures, TOC Detection
INTRODUCTION
Billions of pages are to be scanned and analyzed
to create document image libraries targeted to
real-world applications. The task is daunting;
A Unifed Algorithm for
Identifcation of Various
Tabular Structures from
Document Images
Sekhar Mandal, Bengal Engineering and Science University, Shibpur, India
Amit K. Das, Bengal Engineering and Science University, Shibpur, India
Partha Bhowmick, Indian Institute of Technology Kharagpur, India
Bhabatosh Chanda, Indian Statistical Institute, Kolkata, India
ABSTRACT
This paper presents a unifed algorithm for segmentation and identifcation of various tabular structures from
document page images. Such tabular structures include conventional tables and displayed math-zones, as well
as Table of Contents (TOC) and Index pages. After analyzing the page composition, the algorithm initially
classifes the input set of document pages into tabular and non-tabular pages. A tabular page contains at least
one of the tabular structures, whereas a non-tabular page does not contain any. The approach is unifed in
the sense that it is able to identify all tabular structures from a tabular page, which leads to a considerable
simplifcation of document image segmentation in a novel manner. Such unifcation also results in speed-
ing up the segmentation process, because the existing methodologies produce time-consuming solutions for
treating different tabular structures as separate physical entities. Distinguishing features of different kinds of
tabular structures have been used in stages in order to ensure the simplicity and effciency of the algorithm
and demonstrated by exhaustive experimental results.
however, there is a pressing need for these
libraries, as we witness a spurt of activities in
recent times in industries as well as in academia.
Creation of a document image library involves
a chain of thorough and intense activities like
scanning, per-processing, segmentation, layout
analysis, storage and retrieval, etc. Hence,
it is still constrained with the requirement DOI: 10.4018/jdls.2011040103