Pattern Recognition 110 (2021) 107656 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog Real-time Lexicon-free Scene Text Retrieval Andrés Maﬂa ∗ , Rubèn Tito, Sounak Dey, Lluís Gómez, Marçal Rusiñol, Ernest Valveny, Dimosthenis Karatzas Computer Vision Center, Universitat Autonoma de Barcelona. Ediﬁci O, Campus UAB, Bellaterra (Cerdanyola) Barcelona 08193, Spain a r t i c l e i n f o Article history: Received 6 May 2019 Revised 19 August 2020 Accepted 9 September 2020 Available online 10 September 2020 Keywords: Image retrieval Scene text detection Scene text recognition Word spotting Convolutional neural networks Region proposal networks PHOC a b s t r a c t In this work, we address the task of scene text retrieval: given a text query, the system returns all im- ages containing the queried text. The proposed model uses a single shot CNN architecture that predicts bounding boxes and builds a compact representation of spotted words. In this way, this problem can be modeled as a nearest neighbor search of the textual representation of a query over the outputs of the CNN collected from the totality of an image database. Our experiments demonstrate that the proposed model outperforms previous state-of-the-art, while offering a signiﬁcant increase in processing speed and unmatched expressiveness with samples never seen at training time. Several experiments to assess the generalization capability of the model are conducted in a multilingual dataset, as well as an application of real-time text spotting in videos. © 2020 Elsevier Ltd. All rights reserved. 1. Introduction The development of language is one of the most inﬂuential in- ventions of humankind that allows the communication of abstract and complex ideas. Similarly, written text permits this set of com- plex ideas to be captured, stored and communicated in an explicit manner. As it is shown by several authors [1,2], text is present in a large percentage of real-life imagery, especially in urban scenarios and documents. Adding this to the fact that there is ample avail- ability of visual data and the importance of text, it becomes es- sential to develop algorithms that allow eﬃcient information re- trieval by exploiting the richness of the textual content found in images and video. Leveraging text in scene imagery provides signif- icant boosts to tasks such as image retrieval, scene understanding, instant translation, human-computer interaction, robot navigation, assisted reading for the visually-impaired and industrial automa- tion. In the previous years signiﬁcant advances have been accom- plished, particularly since the introduction of AlexNet [3], archi- tecture that won the ILSVRC2012 [4] contest by using deep learn- ing techniques. Text spotting has been diverging from older ap- proaches that used hand-crafted features towards current ones that employ automatic feature learning by exploiting deep learn- ∗ Corresponding author. E-mail addresses: amaﬂa@cvc.uab.cat (A. Maﬂa), rperez@cvc.uab.cat (R. Tito), sdey@cvc.uab.cat (S. Dey), lgomez@cvc.uab.cat (L. Gómez), marcal@cvc.uab.cat (M. Rusiñol), ernest@cvc.uab.cat (E. Valveny), dimos@cvc.uab.cat (D. Karatzas). ing methodologies [5,6]. Nonetheless, text spotting is not a triv- ial task and remains an open problem in the research community. Putting aside the complexity of spotting text in the wild, the im- portance that text encompasses is given by the high level semantic and explicit information, which can not be leveraged by using vi- sual cues alone. For example, there is a high degree of complexity involved in labelling images without considering the text found in them, even for humans. This effect is evident in Fig. 1, in which the storefronts alone can belong to a wide plethora of businesses, but the exact label can be inferred if and only if the text con- tained is read and leveraged appropriately. Research conducted by Movshovitz et al. [7] showed that while training a shop classiﬁer, the proposed model ended up associating speciﬁc visual represen- tations to textual information as the only way of differentiating be- tween diverse businesses. The described effect is evident and ad- dressed explicitly in later works conducted by [8–10], which focus on ﬁne-grained classiﬁcation of storefronts and bottles respectively. Additional tasks that require integration of scene text and visual information to generate a common domain knowledge have been proposed such as in [11,12], which opens up new research paths. Closely related to our work, Mishra et al. [13] proposed the task of scene text retrieval. The input to the system is a text query, which the system must employ to return all the images that con- tain the queried text. This task requires systems that are robust enough to perform fast word spotting while at the same time holding the capacity of generalizing to out of dictionary queries never seen before. An intuitive approach to tackle such a problem is to make use of state of the art reading systems, and use their https://doi.org/10.1016/j.patcog.2020.107656 0031-3203/© 2020 Elsevier Ltd. All rights reserved.