Enhancing Word Image Retrieval in Presence of Font Variations Viresh Ranjan 1 Gaurav Harit 2 C. V. Jawahar 1 1 CVIT, IIIT Hyderabad, India 2 IIT Jodhpur, India Abstract—This paper investigates the problem of cross docu- ment image retrieval, i.e. use of query images from one style (say font) to perform retrieval from a collection which is in a different style (say a different set of books). We present two approaches to tackle this problem. We propose an effective style independent retrieval scheme using a nonlinear style-content separation model. We also propose a semi-supervised style transfer strategy to expand the query into multiple styles. We validate both these approaches on a collection of word images which vary in fonts/styles. I. I NTRODUCTION Font and style variations make the problem of recognition and retrieval challenging while working with large and diverse document image databases. Commonly, a classifier is trained with a certain set of fonts available apriori, and generalization across fonts is hoped due to either the quality of the features or the power of the classifier. However, in practice, these solutions give degraded performance when used on target documents with a new font. If the entire target dataset is available at the time of training, then it is possible to learn a classifier [1] which could work on several fonts. If the details of the fonts in the database are known, one could render the textual queries in each of these fonts and retrieve from the database [1]. In some cases, a style clustering [2], [3] is done and then separate classifiers are learnt for each of the style clusters. In this work, we are interested in an effective retrieval solution, where the query is a word image, and the database has an unknown set of fonts. We formulate the retrieval problem in a nearest neighbor setting. In this setting, the distance for finding nearest neighbors can be Euclidean [4] or the cost of alignment of two feature vector sequences with a Dynamic Time Warping (DTW) [5]. If the query is a word image, then we need to transfer or expand the query into multiple fonts. Query expansion, which is a technique for reformulating a seed query, is a common practice in information retrieval. In query expansion, a seed query is reformulated by also taking into account semantically and morphologically related words. A natural extension of the query expansion in cross document word image retrieval could be to automatically reformulate the query word in multiple fonts. In this paper, we propose a query reformulation strategy which builds up on this very idea. To motivate the challenges in cross document retrieval, we conduct an experiment on words rendered in two different fonts. We argue that the distance between the two feature vector representation could become ineffective in presence of font variations. In Figure 1, we present the Euclidean distance between profile feature representations of different words in the same font, as well as the same word in different fonts. Smaller inter-class distance 4.7 6.01 5.58 5.7 6.83 1.17 5.93 3.37 1.74 Fig. 1. Euclidean distance between profile feature representation of pairs of word images. Euclidean distance could be affected more by font variation than a difference in underlying word labels, for example, distance between “battle” in the two fonts is more than “battle” and “cattle” in the same font. and larger intraclass distance lead to many false positives and poorer retrieval. This shows that font variation could be a crucial factor while performing cross document word image retrieval (see more in Sec. II). Many efficient approaches for word image retrieval has been proposed in the recent past. Rath and Manmatha [5], as well as Meshesha and Jawahar [6] use a profile based represen- tation along with DTW based retrieval. In many of the recent works, either DTW or Euclidean distance is used. Euclidean distance is often preferred for scalability in retrieval [7]. These approaches primarily depend upon training data in order to handle font variations and may not generalize well in case of previously unseen fonts. If the target style is not known apriori but certain samples (labeled or unlabeled) of the target dataset are known, then it is possible to transfer (adapt) the classifiers learned on the training data so that they are able to handle the new style of the target dataset. This technique is known as transfer learning [8], and it has been widely used in applications like handwriting recognition [2], [9], face pose classification [10] etc. Transfer learning may involve (i) Feature transformations, e.g. updating the regression matrix [11], updating the LDA transformation matrix [12] (ii) Classifier adaptation, e.g. Retraining strategy for neural network [13], SVM [14], etc. The adaptation process needs to be unsupervised if labeled data from the target dataset is not available. The classifier would then need to use some suitable self-learning strategy [15], [16] to learn the style context in a group of patterns. The objective of this work is to perform word image retrieval from a collection of books/documents, where the query word image could be in a different style from those