Automated Detection and Segmentation of Table of Contents Page from Document Images S. Mandal, S. P. Chowdhury, A. K. Das CST Department B. E. College (D.U.) Sibpur, Howrah 7111 103 sekhar, shyama, amit@becs.ac.in Bhabatosh Chanda ECS Unit, Indian Statistical Unit Calcutta 700 035 chanda@isical.ac.in Abstract With an aim to extract the structural information from the table of contents (TOC) to help develop digital docu- ment library the requirement of identifying/segmenting the TOC page is obvious. The objective to create digital doc- ument library is to provide a non-labour intensive, cheap and flexible way of storing, representing and managing the paper document in electronic form to facilitate indexing, viewing, printing and extracting the intended portions. In- formation from the TOC pages be extracted to use in docu- ment database for effective retrieval of the required pages. In this paper we present fully auotmatic identification and segmentation of table of contents (TOC) page from scanned document. Keywords: Document image segmentation, Table of contents detection, Digital document library. 1. Introduction Table of contents (TOC) detection from scanned docu- ment pages is important for a user of the digital document li- brary as an index for the contents of the books, journals, and reports etc. It is also necessary for the document database in the library to keep structural information, like chapters, sections and subsections for easy retrieval of the intended portions as demanded by the user. As a result the identi- fication/segmentation of the TOC from the scanned pages has attracted researchers [16, 2] to put forward a couple of schemes to do the same. It has been observed from the existing literatures that most of the works are directed toward higher level under- standing of the TOC page so as to extract the structural in- formation and representing the whole to a suitable meta- structure like HTML or XML etc. [16]. In doing so they assume that either the TOC is already segmented or some sort of character/symbol recognition technique is applied to identify the TOC pages and their underlying structures. Though, symbol recognition is a part of OCR activity when it is applied to the non-segmented mixed material (text with math-zone and others) computation will be expensive and success far from satisfactory. We on the other hand contend that a better approach is to identify the TOCs from the mixed material thereby helping the subsequent image processing and OCR activities to fo- cus its processing only on the respective zones. In this paper we propose a fully automated technique for identification of TOC page or the portion of TOC in the text page exploiting a priori knowledge of the underlying structure of possible types of TOCs in books, journals etc. It may be noted that we did not use any type of symbol recognition techniques for identification/segmentation of TOCs. Our goal is to identify whether a scanned page or its part is TOC or not using a top down approach that starts with an expectation of encountering a couple of structures available in common TOC forms. And after identification we seg- ment the TOC to its constituent parts into number, title and corresponding page number of each section and subsections to facilitate OCR and help search and browse the document database. We assume that the input is the text portion which have been already segmented from mixed objects in a doc- ument page like, text, graphics and half-tones [4]. 1.1. Past works There are a number of schemes for page layout anal- ysis and segmentation [10, 9, 4, 1, 3, 8]. Most of the works are directed towards segmentation of text, graphics and half-tones. [12, 10, 9] did not go further to extract ta- bles and other structures from the document. In the sys- tem CyberMagazine Takasu et al. [15] proposed segmen- tation of blocks and syntactic analysis of their contents. Article recognition is done using a decision tree classifier and a matrix grammar based syntactic analysis. In [11, 14] O’Gorman and Story proposed method for TOC structure extraction in their Right Pages Electronic Library Systems Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE