Document Analysis Applied to Fragments: Feature Set for the Reconstruction of Torn Documents Markus Diem * Institute of Computer Aided Automation Favoritenstr. 9/1832 1040 Vienna diem@prip.tuwien.ac.at Florian Kleber * Institute of Computer Aided Automation Favoritenstr. 9/1832 1040 Vienna kleber@prip.tuwien.ac.at Robert Sablatnig Institute of Computer Aided Automation Favoritenstr. 9/1832 1040 Vienna sab@prip.tuwien.ac.at ABSTRACT Document analysis is done to analyze entire forms (e.g. in- telligent form analysis, table detection) or to describe the layout/structure of a document. In this paper document analysis is applied to snippets of torn documents to calcu- late features that can be used for reconstruction. The main intention is to handle snippets of varying size and different contents (e.g. handwritten or printed text). Documents can either be destroyed by the intention to make the printed content unavailable (e.g. business crime) or due to time in- duced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, or e.g. inpainting and texture synthesis techniques. In this paper the potential of document analysis techniques of snippets to support a re- construction algorithm by considering additional features is shown. This implies a rotational analysis, a color analysis, a line detection, a paper type analysis (checked, lined, blank) and a classification of the text (printed or hand written). Preliminary results show that these features can be deter- mined reliably on a real dataset consisting of 690 snippets. Categories and Subject Descriptors I.7 [Computing Methodologies]: Document and Text Processing; I.7.5 [Document and Text Processing]: Doc- ument Capture—Document Analysis ; I.4 [Computing Me- thodologies]: Image Processing and Computer Vision Keywords Document reconstruction, skew, layout analysis 1. INTRODUCTION To make information (writings, drawings) inscribed on writ- ing materials (paper, parchment, papyrus) unreadable one possibility is to fragment the writing material. Although * Corresponding Authors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAS ’10, June 9-11, 2010, Boston, MA, USA Copyright 2010 ACM 978-1-60558-773-8/10/06 ...$10.00 parts of the information on single fragments still exist, the entire text and therefore the context of the document is de- stroyed. Reasons for an intended tearing of writing ma- terials are either criminal intentions (business crime, tax fraud investigation, secret service documents [6]) or e.g. the protection of sensitive data/personal information (bank de- tails, credit card numbers). Unintended fragmenting of doc- uments concern either ancient manuscripts that are frag- mented due to environmental effects (influence of mold, wa- ter) or due to catastrophes like the collapse of the historical archive of the City of Cologne (a total of 18 shelve kilome- ters of books has been destroyed)[11]. A reconstruction of fragmented writing materials allow to retrieve and to ana- lyze the lost content. This is done on objects of cultural and historic value, or e.g as already mentioned for crime investigation. In this paper only fragments of “manually” teared paper con- taining German or English text are considered. This means that snippets have an irregular shape and overlapping or missing parts are possible. Contrary mechanically docu- ment shredders that produce either stripes or parallelograms (cross-cut-shredder) are not treated. The reconstruction of shredded paper is discussed in e.g. Ukovich et al. and De Smet et al. [44, 39]. The Fraunhofer Institute for Production Systems and Design Technology Berlin has also developed a system for the reconstruction of shredded paper which has already been used by the German police and tax fraud inves- tigation [36]. It is also assumed that all snippets are scanned with the same resolution with a defined background (allows to apply a global threshold to get a mask image). Reassembling algorithms use either the shape of the frag- ments[31, 12], the content of the fragments (e.g Nielsen et al. [30]) or a combination of shape and content as a fea- ture (e.g Yao et al. [46]). As content feature either color analysis [10] or e.g. inpainting and texture synthesis tech- niques [34] are done for each piece. By taking only the border regions into account the main information printed on the snippet (which can be used for reconstruction) is lost. Problems according manually torn paper are overlap- ping fragments (if shearing effects appear on the disrupted border), and that gaps can occur if pieces of borders are broken or lost. In this paper document analysis techniques are applied to calculate following features: the rotation of each snippet