Automated Detection and Segmentation of Table of Contents Page from
Document Images
S. Mandal, S. P. Chowdhury, A. K. Das
CST Department B. E. College (D.U.)
Sibpur, Howrah 7111 103
sekhar, shyama, amit@becs.ac.in
Bhabatosh Chanda
ECS Unit, Indian Statistical Unit
Calcutta 700 035
chanda@isical.ac.in
Abstract
With an aim to extract the structural information from
the table of contents (TOC) to help develop digital docu-
ment library the requirement of identifying/segmenting the
TOC page is obvious. The objective to create digital doc-
ument library is to provide a non-labour intensive, cheap
and flexible way of storing, representing and managing the
paper document in electronic form to facilitate indexing,
viewing, printing and extracting the intended portions. In-
formation from the TOC pages be extracted to use in docu-
ment database for effective retrieval of the required pages.
In this paper we present fully auotmatic identification and
segmentation of table of contents (TOC) page from scanned
document.
Keywords: Document image segmentation, Table of
contents detection, Digital document library.
1. Introduction
Table of contents (TOC) detection from scanned docu-
ment pages is important for a user of the digital document li-
brary as an index for the contents of the books, journals, and
reports etc. It is also necessary for the document database
in the library to keep structural information, like chapters,
sections and subsections for easy retrieval of the intended
portions as demanded by the user. As a result the identi-
fication/segmentation of the TOC from the scanned pages
has attracted researchers [16, 2] to put forward a couple of
schemes to do the same.
It has been observed from the existing literatures that
most of the works are directed toward higher level under-
standing of the TOC page so as to extract the structural in-
formation and representing the whole to a suitable meta-
structure like HTML or XML etc. [16]. In doing so they
assume that either the TOC is already segmented or some
sort of character/symbol recognition technique is applied
to identify the TOC pages and their underlying structures.
Though, symbol recognition is a part of OCR activity when
it is applied to the non-segmented mixed material (text with
math-zone and others) computation will be expensive and
success far from satisfactory.
We on the other hand contend that a better approach is to
identify the TOCs from the mixed material thereby helping
the subsequent image processing and OCR activities to fo-
cus its processing only on the respective zones. In this paper
we propose a fully automated technique for identification of
TOC page or the portion of TOC in the text page exploiting
a priori knowledge of the underlying structure of possible
types of TOCs in books, journals etc. It may be noted that
we did not use any type of symbol recognition techniques
for identification/segmentation of TOCs.
Our goal is to identify whether a scanned page or its part
is TOC or not using a top down approach that starts with an
expectation of encountering a couple of structures available
in common TOC forms. And after identification we seg-
ment the TOC to its constituent parts into number, title and
corresponding page number of each section and subsections
to facilitate OCR and help search and browse the document
database. We assume that the input is the text portion which
have been already segmented from mixed objects in a doc-
ument page like, text, graphics and half-tones [4].
1.1. Past works
There are a number of schemes for page layout anal-
ysis and segmentation [10, 9, 4, 1, 3, 8]. Most of the
works are directed towards segmentation of text, graphics
and half-tones. [12, 10, 9] did not go further to extract ta-
bles and other structures from the document. In the sys-
tem CyberMagazine Takasu et al. [15] proposed segmen-
tation of blocks and syntactic analysis of their contents.
Article recognition is done using a decision tree classifier
and a matrix grammar based syntactic analysis. In [11, 14]
O’Gorman and Story proposed method for TOC structure
extraction in their Right Pages Electronic Library Systems
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)
0-7695-1960-1/03 $17.00 © 2003 IEEE