Ancient Document Analysis Based on Text Line Extraction Florian Kleber and Robert Sablatnig Pattern Recognition and Image Processing Group Institute of Computer Aided Automation Vienna University of Technology 1040 Vienna, Austria {kleber,sab}@prip.tuwien.ac.at Melanie Gau and Heinz Miklas Institute for Slavonic Studies University of Vienna 1010 Vienna, Austria heinz.miklas@univie.ac.at Abstract In order to preserve our cultural heritage and for automated document processing libraries and national archives have started digitizing historical documents. In the case of degraded manuscripts (e.g. by mold, hu- midity, bad storage conditions) the text or parts of it can disappear. The remaining parts of the text can be seg- mented and the ruling can be extrapolated with the a priori knowledge. Since the ruling defines the position of the text within a page, it can be used for layout analy- sis and as a basis for the enhancement of the readability. Furthermore, information about the scribe (hand) of the manuscript, its spatiotemporal origin can be gained by analyzing the ruling. This paper presents an algorithm for ruling estimation of Glagolitic texts based on text line extraction and is suitable for degraded manuscripts by extrapolating the baselines with the a priori knowl- edge of the ruling. The algorithm was tested on 30 pages of the Missale Sinaiticum and the evaluation was based on visual criteria. 1 Introduction Digitalization permits a lossless storage of the con- tents of a document and a worldwide exchange or ac- cess of documents. If entire libraries or parts of libraries of national archives or museums are digitized 1 , manual analysis or digital restoration of documents will not be feasible due to costs and time. Developed algorithms for automated document anal- ysis systems distinguish between documents contain- ing handwritten text, printed text, and documents con- taining text and images. Methods for e.g. the lay- out analysis of handwritten historical documents [1] and text line segmentation in handwritten documents [11, 10, 19] have been published. In comparison to the mentioned text line segmentation algorithm, the pro- 1 http://www.michael-culture.org/, accessed march 2008 posed approach attempts to fit the extracted line infor- mation into a a priori ruling scheme to gain information about the scribe (hand) of the manuscript, its spatiotem- poral 2 origin, and for layout analysis. The algorithm proposed in this paper is used to ana- lyze the ruling of the Missale Sinaiticum (Cod. Sin. slav. 5/N, 11 th century) [13]. It contains handwrit- ten Glagolitic [13] text and was found in 1975 at St. Catherine’s monastery on Mt. Sinai. The images of the Missale Sinaiticum were digitised in the project The Sinaitic Glagolitic Sacramentary (Euchologium) Frag- ments 3 . The folios are degraded due to water influence (possible degradations of parchment see [3]). There- fore, the text lines of the folios have to be segmented and an extrapolation of the extracted ruling must be done with the a priori knowledge of the ruling scheme [13] (see Figure 3 (b)). The developed algorithm has a preprocessing stage, which comprises a skew estima- tion, an adaptive image binarization, and a noise re- moval. After this preprocessing step the text compo- nents (words, characters, etc.) are segmented and finally grouped to extract the text lines. This paper is organized as follows: Section 2 reviews the state of the art of text line extraction methods. In Section 3 the proposed algorithm is described in de- tail, while in Section 4 the experimental results are dis- cussed; then follows the conclusion. 2 Related Work Methods proposing algorithms for text line extrac- tion include techniques based on projection profiles, the Hough transformation [11, 9], perceptual group- ing/clustering [8], Adaptive Local Connectivity Maps (ALCM) [19] or e.g. methods based on smearing [10]. Likforman-Sulem et al. [10] presents an overview paper 2 see e.g. Leroy for the Greek tradition [7] 3 The authors would like to thank the Austrian Science Fund for funding the project under grant P19608-G12 978-1-4244-2175-6/08/$25.00 ©2008 IEEE