Evaluating SEE - A Benchmarking System for Document Page Segmentation
Stefan Agne, Andreas Dengel, and Bertin Klein
German Research Center for Artificial Intelligence (DFKI GmbH)
P.O. Box 2080, D-67608 Kaiserslautern, Germany
e-mail: stefan.agne, andreas.dengel, bertin.klein @dfki.de
Abstract
The decomposition of a document into segments such as
text regions and graphics is a significant part of the docu-
ment analysis process. The basic requirement for rating and
improvement of page segmentation algorithms is systematic
evaluation. The approaches known from the literature have
the disadvantage that manually generated reference data
(zoning ground truth) are needed for the evaluation task.
The effort and cost of the creation of these data are very
high.
This paper describes the evaluation system SEE and
presents an assessment of its quality.. The system requires
the OCR generated text and the original text of the doc-
ument in correct reading order (text ground truth) as in-
put. No manually generated zoning ground truth is needed.
The implicit structure information that is contained in the
text ground truth is used for the evaluation of the automatic
zoning. Therefore, an assignment of the corresponding text
regions in the text ground truth and those in the OCR gener-
ated text (matches) is sought. A fault tolerant string match-
ing algorithm underlies a method, able to tolerate OCR er-
rors in the text. The segmentation errors are determined as
a result of the evaluation of the matching. Subsequently,
the edit operations which are necessary for the correction
of the recognized segmentation errors are computed to es-
timate the correction costs. Furthermore, SEE provides a
version of the OCR generated text, that is corrected from
the detected page segmentation errors.
1 Introduction
In the domain of document analysis, document page seg-
mentation is a very significant field of research. The task is
to divide documents into separate components such as text
regions and graphics. For this purpose, several approaches
have been developed.
For development and improvement, as well as for the se-
zoning -
ground truth
documents
results of
segmentation
measures
automatic
zoning
comparison
Figure 1. Benchmarking in the field of docu-
ment analysis
lection of segmentation algorithms, it is important to eval-
uate these algorithms objectively, especially in comparison
to each other. This process is called benchmarking.
The basic principles of benchmarking in the field of doc-
ument analysis are shown in Figure 1.
In the first step the zoning ground truth is produced man-
ually for each document. This zoning ground truth is con-
sidered the correct decomposition of the document into re-
gions. For instance a region can be specified by a polygon.
During the process of automatic zoning the document is
divided automatically into regions. The result of the au-
tomatic zoning is then compared with the corresponding
zoning ground truth in order to evaluate the quality of the
decomposition. Based on this comparison evaluation mea-
sures are computed.
The objective of the paper is to present an evaluation
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)
0-7695-1960-1/03 $17.00 © 2003 IEEE