Cross-Probe BERT for Fast Cross-Modal Search
Tan Yu
tanyu01@baidu.com
Cognitive Computing Lab
Baidu Research
Bellevue, WA, USA
Hongliang Fei
hongliangfei@baidu.com
Cognitive Computing Lab
Baidu Research
Bellevue, WA, USA
Ping Li
liping11@baidu.com
Cognitive Computing Lab
Baidu Research
Bellevue, WA, USA
ABSTRACT
Owing to the efectiveness of cross-modal attentions, text-vision
BERT models have achieved excellent performance in text-image
retrieval. Nevertheless, cross-modal attentions in text-vision BERT
models require expensive computation cost when tackling text-
vision retrieval due to their pairwise input. Therefore, normally,
it is impractical for deploying them for large-scale cross-modal
retrieval in real applications. To address the inefciency issue in
exiting text-vision BERT models, in this work, we develop a novel
architecture, cross-probe BERT. It devises a small number of text
and vision probes, and the cross-modal attentions are efciency
achieved through the interactions between text and vision probes.
It takes lightweight computation cost, and meanwhile efectively
exploits cross-modal attention. Systematic experiments on public
benchmarks demonstrate excellent efectiveness and efciency of
our cross-probe BERT.
CCS CONCEPTS
· Computing methodologies → Artifcial intelligence; · In-
formation systems → Image search.
KEYWORDS
Cross-modal Retrieval; Multimedia Search; Cross-modal BERT;
ACM Reference Format:
Tan Yu, Hongliang Fei, and Ping Li. 2022. Cross-Probe BERT for Fast Cross-
Modal Search. In The 45th International ACM SIGIR Conference on Research
and Development in Information Retrieval, July 11-15, 2022, Madrid, Spain.
ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3477495.3531826
1 INTRODUCTION
Inspired by great success achieved by self-attention mechanism of
Transformer [32] and BERT [4] in NLP tasks, several text-vision
BERT models [14, 17, 20, 37] emerge. They take the query-image
pair as input and extend the original text-modal self-attention to
the multi-modal self-attention. The text-vision BERT efectively
models the interactions between image features and query features,
provides contextual encoding for both image features and query
features, and achieves an excellent cross-modal retrieval.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
SIGIR, July 11-15, 2022, Madrid, Spain
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-8732-3/22/07. . . $15.00
https://doi.org/10.1145/3477495.3531826
In spite of the high efectiveness achieved by text-vision BERT,
the extremely high computation cost brought by pairwise input
limits its practical usefulness, especially for large-scale cross-modal
retrieval in industrial applications. Given a query and reference
images, it needs to feed query-image pairs to text-vision BERT
for relevance scores. That is, it requires to repeatedly encode the
query for times. In a large-scale cross-modal retrieval task, is
extremely large, making text-vision BERT prohibitively slow for
obtaining relevant scores with all reference images. In contrast, a
two-tower encoder only needs to encode the query for one time,
and reference image features can be pre-computed ofine and
cached in the database. Thus, it obtains relevance scores between
the query and images in an efcient manner by computing similar-
ities between the query feature and pre-computed image features.
Though the inefcient pairwise attention limits the usefulness of
text-vision BERT in large-scale cross-modal retrieval, there are few
works to speed up text-vision BERT. In fact, the inefciency caused
by pairwise input is a general problem which is also encountered in
other retrieval tasks such as query-to-document retrieval [10], ques-
tion answering [2, 23]. In these tasks, similarly there are two main-
stream encoders for obtaining the relevance score. The frst type
of encoder, Bi-encoder [5], is based on the two-tower architecture.
Since the query/question and document are independently encoded,
document features can be pre-computed and cached. In this case,
the relevance between the query and each document can be deter-
mined by the cosine similarity between the query/question’s feature
and the document’s cached feature. It achieves a high efciency but
a relatively low retrieval accuracy. In contrast, Cross-encoder [31]
takes a question-answer pair as input, exploiting the cross-attention
like text-vision BERT and achieving high retrieval accuracy but is
inefcient. To balance efectiveness and efciency, existing meth-
ods [2, 23] adopt the two-tower architecture in lower layers and
use the cross-attention architecture in the upper layers. We term
this architecture as łsplit-mergež encoder. In this case, features
from lower two-tower layers are pre-computed and cached. Then
question-answer attentions are conducted in upper layers. Since
the number of upper cross-attention layers is small, the efciency is
boosted. Similarly, Poly-Encoder [10] conducts the two-tower archi-
tecture for feature extraction, and uses an additional cross-attention
layer on the top to obtain the similarity.
In this paper, we propose a novel architecture, cross-probe BERT
(CP-BERT) for efective and efcient cross-modal retrieval. We
devise several vision probes and text probes along with the image’s
local features and the query’s word features. In the lower a few
layers, we adopt the two-tower architecture. The vision probes and
the image’s local features are concatenated and fed into the vision
tower and generate the attended vision probes. In parallel, the text
probes and the query’s word features are concatenated and fed into
Short Research Paper SIGIR ’22, July 11–15, 2022, Madrid, Spain
2178