A Fast Keyword-Spotting Technique Linlin Li, Shijian Lu and Chew Lim Tan Department of Computer Science, National University of Singapore Kent Ridge, Singapore 117543 {lilinlin,lusj,tancl}@comp.nus.edu.sg Abstract In order to capture the content of an imaged document but avoid the time-consuming full-scale OCR which is frag- ile to handle touching characters, a fast and segmentation- free keyword spotting method is proposed in this paper. The keyword spotting method is based on word shape coding technique. The proposed coding scheme has little ambigu- ity, and can be swiftly executed. It is a promising technique to boost better document image retrieval. The strength of the proposed method is demonstrated in a document filter- ing experiment. The experimental results show that docu- ment filtering based on the proposed method is more than 20 times faster than the one based on OCR, and has com- parable filtering accuracy. 1. Introduction To date, many efforts have been made to build digital libraries which digitize high-volume archives of paper doc- uments (patent, legal tomes, historical documents) to pro- vide the public with free and easy on-line access. These digital libraries store scanning images, which keep visual information such as layout and decorations. However, this leads to difficulties in document retrieval, because tradi- tional text information retrieval techniques totally fail when documents are simply presented as raw bit-maps. A feasible solution is OCR, but current OCR softwares are not always able to provide accurate and reliable document image tran- scriptions. Based on character-segmentation, OCR perfor- mance degrades dramatically when touching adjacent char- acters appear frequently in an image. Furthermore, OCR technique requires very long execution time. Therefore, for information retrieval applications such as document filter- ing which are based on locating a few important keywords appearing in a document, it is a waste to resort to a full scale OCR. More importantly, for these databases with large vol- umes of document images, the time complexity makes it impractical to convert all images in to text by OCR. A technique known as word shape analysis is proposed as an alternative to full-scale OCR. The technique is sup- posed to be faster and more reliable when document images are of bad quality. A complete survey of word shape anal- ysis techniques could be found in [4] [6]. To date, word shape coding techniques could be roughly divided into two groups, both of which represent a word image as a whole unit instead of recognizing each character. The first group [3] is based on analyzing pixel-level features of the whole word image, such as intensity autocorrelation and moments. In these approaches, each word image in a document is rep- resented by feature vectors. These approaches are language independent and very roust to poor image quality, but they require appropriate training sets. Besides, the vector is dif- ficult to be indexed. The second group is based on word shape coding [7] [8] [5]. Word shape coding encodes a word image into a sequence of predefined symbols. The symbol set is often smaller than the character set and is eas- ier to be recognized. Each word has a unique corresponding symbol string, while one symbol string may be mapped to several real words because of the reduced symbol set, which is referred as ambiguity. For a language, the limited num- ber of character arrangement may help to avoid ambiguity or reduce it to an acceptable level, which will be discussed later in this paper. Compared with the first group, these ap- proaches have advantages such as that they are easy to form the query, easy to index and training free. On the other hand, the drawbacks are language dependent and not robust as the first group. We propose in this paper a fast and segmentation-free keyword spotting technique. Keyword spotting is to locate the occurrences of a given keyword from an image. The proposed method has two components: word shape coding and similarity estimation. It is directly related to retrieval applications like Boolean retrieval and document filtering, and it is promising to facilitate better document image re- trieval with more sophisticated IR models. This paper is organized as the following. In sections 2 and 3, the related works and the proposed keyword spot- ting method are introduced in detail. Section 4 introduces