Ancient Document Analysis Based on Text Line Extraction
Florian Kleber and Robert Sablatnig
Pattern Recognition and Image Processing Group
Institute of Computer Aided Automation
Vienna University of Technology
1040 Vienna, Austria
{kleber,sab}@prip.tuwien.ac.at
Melanie Gau and Heinz Miklas
Institute for Slavonic Studies
University of Vienna
1010 Vienna, Austria
heinz.miklas@univie.ac.at
Abstract
In order to preserve our cultural heritage and for
automated document processing libraries and national
archives have started digitizing historical documents.
In the case of degraded manuscripts (e.g. by mold, hu-
midity, bad storage conditions) the text or parts of it can
disappear. The remaining parts of the text can be seg-
mented and the ruling can be extrapolated with the a
priori knowledge. Since the ruling defines the position
of the text within a page, it can be used for layout analy-
sis and as a basis for the enhancement of the readability.
Furthermore, information about the scribe (hand) of the
manuscript, its spatiotemporal origin can be gained by
analyzing the ruling. This paper presents an algorithm
for ruling estimation of Glagolitic texts based on text
line extraction and is suitable for degraded manuscripts
by extrapolating the baselines with the a priori knowl-
edge of the ruling. The algorithm was tested on 30
pages of the Missale Sinaiticum and the evaluation was
based on visual criteria.
1 Introduction
Digitalization permits a lossless storage of the con-
tents of a document and a worldwide exchange or ac-
cess of documents. If entire libraries or parts of libraries
of national archives or museums are digitized
1
, manual
analysis or digital restoration of documents will not be
feasible due to costs and time.
Developed algorithms for automated document anal-
ysis systems distinguish between documents contain-
ing handwritten text, printed text, and documents con-
taining text and images. Methods for e.g. the lay-
out analysis of handwritten historical documents [1]
and text line segmentation in handwritten documents
[11, 10, 19] have been published. In comparison to
the mentioned text line segmentation algorithm, the pro-
1
http://www.michael-culture.org/, accessed march 2008
posed approach attempts to fit the extracted line infor-
mation into a a priori ruling scheme to gain information
about the scribe (hand) of the manuscript, its spatiotem-
poral
2
origin, and for layout analysis.
The algorithm proposed in this paper is used to ana-
lyze the ruling of the Missale Sinaiticum (Cod. Sin.
slav. 5/N, 11
th
century) [13]. It contains handwrit-
ten Glagolitic [13] text and was found in 1975 at St.
Catherine’s monastery on Mt. Sinai. The images of
the Missale Sinaiticum were digitised in the project The
Sinaitic Glagolitic Sacramentary (Euchologium) Frag-
ments
3
. The folios are degraded due to water influence
(possible degradations of parchment see [3]). There-
fore, the text lines of the folios have to be segmented
and an extrapolation of the extracted ruling must be
done with the a priori knowledge of the ruling scheme
[13] (see Figure 3 (b)). The developed algorithm has a
preprocessing stage, which comprises a skew estima-
tion, an adaptive image binarization, and a noise re-
moval. After this preprocessing step the text compo-
nents (words, characters, etc.) are segmented and finally
grouped to extract the text lines.
This paper is organized as follows: Section 2 reviews
the state of the art of text line extraction methods. In
Section 3 the proposed algorithm is described in de-
tail, while in Section 4 the experimental results are dis-
cussed; then follows the conclusion.
2 Related Work
Methods proposing algorithms for text line extrac-
tion include techniques based on projection profiles,
the Hough transformation [11, 9], perceptual group-
ing/clustering [8], Adaptive Local Connectivity Maps
(ALCM) [19] or e.g. methods based on smearing [10].
Likforman-Sulem et al. [10] presents an overview paper
2
see e.g. Leroy for the Greek tradition [7]
3
The authors would like to thank the Austrian Science Fund for
funding the project under grant P19608-G12
978-1-4244-2175-6/08/$25.00 ©2008 IEEE