Learning to Rank Words: Optimizing Ranking Metrics for Word Spotting Pau Riba [0000−0002−4710−0864] , Adri` a Molina [0000−0003−0167−8756] , Lluis Gomez [0000−0003−1408−9803] , Oriol Ramos-Terrades [0000−0002−3333−8812] , and JosepLlad´os [0000−0002−4533−4739] Computer Vision Center and Computer Science Department, Universitat Aut` onoma de Barcelona, Catalunya {priba,lgomez,oriolrt,josep}@cvc.uab.cat, adria.molinar@e-campus.uab.cat Abstract. In this paper, we explore and evaluate the use of ranking- based objective functions for learning simultaneously a word string and a word image encoder. We consider retrieval frameworks in which the user expects a retrieval list ranked according to a defined relevance score. In the context of a word spotting problem, the relevance score has been set according to the string edit distance from the query string. We ex- perimentally demonstrate the competitive performance of the proposed model on query-by-string word spotting for both, handwritten and real scene word images. We also provide the results for query-by-example word spotting, although it is not the main focus of this work. Keywords: Word Spotting · Smooth-nDCG · Smooth-AP · Ranking Loss. 1 Introduction Word spotting, also known as keyword spotting, was introduced in the late 90’s in the seminal papers of Manmatha et al. [19,20]. It emerged quickly as a highly effective alternative to text recognition techniques in those scenarios with scarce data availability or huge style variability, where a strategy based on full transcription is still far from being feasible and its objective is to obtain a ranked list of word images that are relevant to a user’s query. Word spotting has been typically classified in two particular settings according to the target database gallery. On the one hand, there are the segmentation-based methods, where text images are segmented at word image level [8,28]; and, on the other hand, the segmentation-free methods, where words must be spotted from cropped text- lines, or full documents [2,26]. Moreover, according to the query modality, these methods can be classified either query-by-example (QbE) [25] or query-by-string (QbS) [3,13,28,33], being the second one, the more appealing from the user perspective. The current trend in word spotting methods is based on learning a mapping function from the word images to a known word embedding spaces that can be arXiv:2106.05144v1 [cs.CV] 9 Jun 2021