Pattern Recognition 110 (2021) 107656
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Real-time Lexicon-free Scene Text Retrieval
Andrés Mafla
∗
, Rubèn Tito, Sounak Dey, Lluís Gómez, Marçal Rusiñol, Ernest Valveny,
Dimosthenis Karatzas
Computer Vision Center, Universitat Autonoma de Barcelona. Edifici O, Campus UAB, Bellaterra (Cerdanyola) Barcelona 08193, Spain
a r t i c l e i n f o
Article history:
Received 6 May 2019
Revised 19 August 2020
Accepted 9 September 2020
Available online 10 September 2020
Keywords:
Image retrieval
Scene text detection
Scene text recognition
Word spotting
Convolutional neural networks
Region proposal networks
PHOC
a b s t r a c t
In this work, we address the task of scene text retrieval: given a text query, the system returns all im-
ages containing the queried text. The proposed model uses a single shot CNN architecture that predicts
bounding boxes and builds a compact representation of spotted words. In this way, this problem can be
modeled as a nearest neighbor search of the textual representation of a query over the outputs of the
CNN collected from the totality of an image database. Our experiments demonstrate that the proposed
model outperforms previous state-of-the-art, while offering a significant increase in processing speed and
unmatched expressiveness with samples never seen at training time. Several experiments to assess the
generalization capability of the model are conducted in a multilingual dataset, as well as an application
of real-time text spotting in videos.
© 2020 Elsevier Ltd. All rights reserved.
1. Introduction
The development of language is one of the most influential in-
ventions of humankind that allows the communication of abstract
and complex ideas. Similarly, written text permits this set of com-
plex ideas to be captured, stored and communicated in an explicit
manner. As it is shown by several authors [1,2], text is present in a
large percentage of real-life imagery, especially in urban scenarios
and documents. Adding this to the fact that there is ample avail-
ability of visual data and the importance of text, it becomes es-
sential to develop algorithms that allow efficient information re-
trieval by exploiting the richness of the textual content found in
images and video. Leveraging text in scene imagery provides signif-
icant boosts to tasks such as image retrieval, scene understanding,
instant translation, human-computer interaction, robot navigation,
assisted reading for the visually-impaired and industrial automa-
tion.
In the previous years significant advances have been accom-
plished, particularly since the introduction of AlexNet [3], archi-
tecture that won the ILSVRC2012 [4] contest by using deep learn-
ing techniques. Text spotting has been diverging from older ap-
proaches that used hand-crafted features towards current ones
that employ automatic feature learning by exploiting deep learn-
∗
Corresponding author.
E-mail addresses: amafla@cvc.uab.cat (A. Mafla), rperez@cvc.uab.cat (R. Tito),
sdey@cvc.uab.cat (S. Dey), lgomez@cvc.uab.cat (L. Gómez), marcal@cvc.uab.cat (M.
Rusiñol), ernest@cvc.uab.cat (E. Valveny), dimos@cvc.uab.cat (D. Karatzas).
ing methodologies [5,6]. Nonetheless, text spotting is not a triv-
ial task and remains an open problem in the research community.
Putting aside the complexity of spotting text in the wild, the im-
portance that text encompasses is given by the high level semantic
and explicit information, which can not be leveraged by using vi-
sual cues alone. For example, there is a high degree of complexity
involved in labelling images without considering the text found in
them, even for humans. This effect is evident in Fig. 1, in which
the storefronts alone can belong to a wide plethora of businesses,
but the exact label can be inferred if and only if the text con-
tained is read and leveraged appropriately. Research conducted by
Movshovitz et al. [7] showed that while training a shop classifier,
the proposed model ended up associating specific visual represen-
tations to textual information as the only way of differentiating be-
tween diverse businesses. The described effect is evident and ad-
dressed explicitly in later works conducted by [8–10], which focus
on fine-grained classification of storefronts and bottles respectively.
Additional tasks that require integration of scene text and visual
information to generate a common domain knowledge have been
proposed such as in [11,12], which opens up new research paths.
Closely related to our work, Mishra et al. [13] proposed the task
of scene text retrieval. The input to the system is a text query,
which the system must employ to return all the images that con-
tain the queried text. This task requires systems that are robust
enough to perform fast word spotting while at the same time
holding the capacity of generalizing to out of dictionary queries
never seen before. An intuitive approach to tackle such a problem
is to make use of state of the art reading systems, and use their
https://doi.org/10.1016/j.patcog.2020.107656
0031-3203/© 2020 Elsevier Ltd. All rights reserved.