Author's personal copy A Document Image Retrieval System Konstantinos Zagoris a , Kavallieratou Ergina b , Nikos Papamarkos a,n a Image Processing and Multimedia Laboratory, Department of Electrical & Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece b Department of Information and Communication Systems Engineering, University of the Aegean, Samos 83100, Greece article info Article history: Received 9 February 2008 Received in revised form 9 March 2010 Accepted 10 March 2010 Available online 13 April 2010 Keywords: Document retrieval Word spotting Segmentation Information retrieval Feature extraction abstract In this paper, a system is presented that locates words in document image archives. This technique performs the word matching directly in the document images bypassing character recognition and using word images as queries. First, it makes use of document image processing techniques, in order to extract powerful features for the description of the word images. The features used for the comparison are capable of capturing the general shape of the query, and escape details due to noise or different fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents and we compared our results with those of a commercial optical character recognition (OCR) package. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction In the last years, the world has experienced a phenomenal growth of the size of multimedia data and especially document images, which have been increased thanks to the ease to create such images using scanners or digital cameras. Thus, huge quanti- ties of document images are created and stored in image archives without having any indexing information. In order to satisfactorily exploit these collections of document images, it is necessary to develop effective techniques to retrieve the document images. A detailed survey on document image retrieval up to 1997 can be found in Doermann (1998). Historically, the use of index of descriptors for each document provided manually by experts was the first approach to the problem (Salton, 1989). Next, with the improvement in character recognition field, optical character recognition (OCR) packages were applied to documents in order to convert them to text. These techniques transformed the characters, which were contained in the image into a machine-editable text. Thus, Edwards (2004) described an approach to transcribing and retrieving Medieval Latin manu- scripts with generalized Hidden Markov Models. Their hidden states correspond to characters and the space between them. The training instance is used per character and character n-grams are used, yielding a transcription accuracy of 75%. Tan et al. (2002) described an approach to retrieve machine printed docu- ments with a textual query, not necessarily in ASCII notation. He describes both the query and the words occurring in the document images with features, which may then be matched in order to identify query term occurrences. A disadvantage of the above approaches is the considerably low noise tolerance, which yields low retrieval scores (Ishitani, 2001). More recently, with the improvement in document image processing (DIP) field, techniques that make use of images instead of OCR were also introduced. Leydier et al. (2005) used DIP tech- niques to create a pattern dictionary of each document and then they performed word spotting by selecting the feature of the gradient angle and a matching algorithm. Kolcz et al. (2000) described an approach for retrieving handwritten documents using word image templates. Their word image comparison algorithm is based on matching the provided templates to segmented manu- script lines from the Archive of the Indies collection. Konidaris et al. (2007) proposes a technique for keyword guided word spotting in historical printed documents. He creates synthetic image words as query and performs word segmentation using dynamic para- meters and hybrid feature extraction. Finally, he uses user feedback to optimize the retrieval. Matching of entire words in printed documents is also performed by Balasubramanian et al. (2006). In this approach, a dynamic time warping (DTW) based partial matching scheme is used to overcome the morphological differ- ences between the words. Similar technique is used in the case of historical documents (Rath and Manmatha, 2003) where noisy handwritten document images are preprocessed into one-dimen- sional feature sets and compared using the DTW algorithm. Rath et al. (2004) presented a method for retrieving large collections of handwritten historical documents using statistical models. Using a word image matching algorithm, he clustered occurrences of the ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/engappai Engineering Applications of Artificial Intelligence 0952-1976/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2010.03.002 n Corresponding author. Tel.: + 30 25410 79585. E-mail address: papamark@ee.duth.gr (N. Papamarkos). Engineering Applications of Artificial Intelligence 23 (2010) 872–879