Zoning Aggregated Hypercolumns for Keyword Spotting Giorgos Sﬁkas, George Retsinas, Basilis Gatos Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientiﬁc Research Demokritos, GR-15310 Agia Paraskevi, Athens, Greece {sﬁkas, georgeretsi, bgat}@iit.demokritos.gr Abstract—In this paper we present a novel descriptor and method for segmentation-based keyword spotting. We intro- duce Zoning-Aggregated Hypercolumn features as pixel-level cues for document images. Motivated by recent research in machine vision, we use an appropriately pretrained convolu- tional network as a feature extraction tool. The resulting local cues are subsequently aggregated to form word-level ﬁxed- length descriptors. Encoding is computationally inexpensive and does not require learning a separate feature generative model, in contrast to other widely used encoding methods (such as Fisher Vectors). Keyword spotting trials on machine- printed and handwritten documents show that the proposed model gives very competitive results. I. I NTRODUCTION In cases where full recognition is not necessary or high-quality recognition is not feasible, keyword spotting techniques allow the end-user to search a document for instances of a speciﬁc word [1], [2]. Keyword spotting and recognition, especially in handwritten documents, remain signiﬁcant challenges compared to other forms of text, albeit also important recent advances [2]. The same is true for other document understanding actions such as layout analysis or text segmentation. Writing style variance and cursiveness in the documents of a single author, and variance between styles of different authors are important problems one has to face in processing of handwritten text, not found in machine-printed documents. Older handwritten manuscripts are related with extra difﬁculties due to degradations in quality of the digitized document, making all document understanding operations even more difﬁcult. Segmentation of a document in text components has been used in many document understanding systems as a basic pre-processing step. Text components that are used routinely as the target element of segmentation algorithms are text lines and text words. Word spotting can then be formulated as an image retrieval problem when the query is a word image. Another vein of techniques assume that the user uses a word string as a query. The two approaches are known as Query-by-Example (QbE) and Query-by-String (QbS) in the literature [2]. In this work, a novel QbE, segmentation-based keyword spotting method is proposed. We use a deep Convolutional Neural Network (CNN) pretrained for a character classiﬁca- tion task [3]. Instead of using the CNN for the task it was originally trained for, i.e. character classiﬁcation, we use it as an off-the-shelf feature extractor. It has been shown that per- layer activations can act as efﬁcient local descriptors [4], [5]. The pool of resulting convolutional features is aggregated to a single descriptor per word image. Aggregation is done by combining simple sum-pooling, that has been recently demonstrated to be an appropriate encoding technique for convolutional features [6]. We combine this simple aggrega- tion model with a zoning scheme, suitable for word images, to create the ﬁxed-length word-level feature vector. We shall refer to this word-level descriptor as a ”Zoning-Aggregated Hypercolumns” descriptor (ZAH). Querying with the pro- posed descriptor is performed by nearest-neighbour search in the (Euclidean) descriptor space. Numerical results show that our approach leads to competitive KWS results. The outline of the rest of this paper is as follows. In section II we discuss related works in the literature. In section III we present the proposed method in detail. In section IV we present numerical results comparing our method to other segmentation-based word spotting methods. In section V we present ﬁnal conclusions and thoughts about future work. II. RELATED WORK Keyword spotting can be seen as a special form of an image retrieval problem. Like in image retrieval, suitable descriptors have to be created for the query and each word image in the document to be searched. As in all image understanding tasks, features matter, and powerful features have been used to build good descriptors for word spotting and recognition tasks. Such features range from low-level column-based proﬁles to more elaborate shape-based or patch-based features [1], [7], [2]. Given a dense set of feature vectors per image, matching is performed either by dynamic programming [7], direct comparison using a suitable metric (often the Euclidean) [1] or using an encoding technique ﬁrst [8] before comparing the encoded vectors. Various state-of- the-art models that are based on the encoding of patch-based SIFT or HOG features have been proposed [2], [9], [8].