A. Petrosino (Ed.): ICIAP 2013, Part I, LNCS 8156, pp. 61–70, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Layout-Based Document-Retrieval System by Radon
Transform Using Dynamic Time Warping
Giuseppe Pirlo
1,*
, Michela Chimienti
2
, Michele Dassisti
3
,
Donato Impedovo
4
, and Angelo Galiano
4
1
Dipartimento di Informatica, Università degli Studi di Bari "A. Moro",
via Orabona 4, 70125-Bari, Italy
2
Laboratorio Kad3, C.da Baione, 70043 Monopoli (BA), Italy
3
Dip. Meccanica, Management e Matematica, Politecnico di Bari,
viale Japigia 182, 70126 - Bari, Italy
4
Dyrecta Lab, Via V. Simplicio 45, 70014 Conversano (BA), Italy
giuseppe.pirlo@uniba.it
Abstract. In the context of sustainability of document management
technologies, this paper presents a new system for layout-based document
retrieval specifically designed for commercial form retrieval. The system first
uses a technique based on mathematical morphology to extract grid-based
structural components from the document image. Successively, Radon
Transform is used for document layout description. A document matching
technique based on dynamic time warping is finally adopted. The experimental
results carried out on real and simulated data set, demonstrate the effectiveness
of the approach with respect to different classes of commercial forms.
Keywords: Document management, Document Image Retrieval, Sustainability,
Mathematic Morphology, Radon Transform, Dynamic Time Warping.
1 Introduction
Information Retrieval (IR) is a critical task of document management systems as the
number of documents available in databases and digital libraries exponentially grows.
Quite often useless reprinting becomes a necessary activity in case of document loss
or unavailability. This is also due to standard systems for document retrieval that use
text data. They require a document to be present in text form and the querying method
is based on a specific textual content in the document. Several advanced techniques
have been proposed, based on set-theoretic, algebraic and probabilistic models [1, 2,
3]. Whatever the model used, one of the main drawback of text-based document
retrieval systems is that they require a document in text form, since the search for
similar documents is based on comparing the textual contents. As a consequence, a
preliminary stage of image to text conversion by an Optical Character Recognizer
(OCR) is required when a document is in image form. OCR is a time-consuming
*
Corresponding author.