Segmentation-free Word Spotting in Historical Printed Documents B. Gatos and I. Pratikakis Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Research Center "Demokritos", 153 10 Athens, Greece {bgat, ipratika}@iit.demokritos.gr Abstract In this paper, a new efficient word spotting methodology is presented that can be applied to historical printed documents without requiring any previous block or word segmentation step. Our aim is to address a methodology which is segmentation-free since in many cases of historical documents, the segmentation process does not produce meaningful results due to unconstraint layout, several degradations or typesetting imperfections. The proposed method is based on block-based document image descriptors that are used at a template matching process satisfying invariance in terms of translation, rotation and scaling. Improvement in terms of time expense is obtained by applying the matching process only on salient regions of the image. Experimental results on a database with representative historical printed documents prove the efficiency of the proposed approach. 1. Introduction Effective historical document indexing and retrieval poses a great challenge due to the vast amount of information that is available in libraries all over the world in the form of printed or handwritten manuscripts. The challenge is amplified by the variability of documents due to the multi-linguality and the wide range of historical periods that available collections are built, as well as by the poor quality of existing historical documents. Word spotting is a content-based retrieval procedure which results in a ranked list of word images that are similar to a query word image. The query comprises either an actual example from the collection of interest or it is artificially generated from an ASCII keyword. A crucial aspect in the retrieval procedure is the word image representation which relies upon robust features. The word spotting procedure is mostly used in an unsupervised manner and the lack of dependencies like training along with the ease to use several different feature variations make it as a very appealing alternative to Optical Character Recognition (OCR) which is a difficult problem to solve, especially for historical documents. In the literature, word spotting appears under two distinct trends: the segmentation-based approach and the segmentation-free approach. In the former approach, there is a tremendous effort towards solving the word segmentation problem [1-4]. In the latter approach, the query word image is fitted to the corresponding word images in the document without any segmentation involved, mostly seen the underlying problem as a template matching. Representative work is reported in [5], which uses differential features that are compared using a cohesive elastic matching method, based on zones of interest in order to match only the informative parts of the words. In the same spirit with the aforementioned approach, this paper concerns a segmentation-free word spotting methodology which permits a fast and effective retrieval based on block-based document image descriptors that are used at a template matching process satisfying invariance in terms of translation, rotation and scaling. The remainder of the paper will be structured as follows. The proposed methodology is detailed in Section 2. In Section 3, the evaluation results on representative historical documents are presented, and in Section 4, conclusions are drawn. 2009 10th International Conference on Document Analysis and Recognition 978-0-7695-3725-2/09 $25.00 © 2009 IEEE DOI 10.1109/ICDAR.2009.236 271