Cross-Probe BERT for Fast Cross-Modal Search Tan Yu tanyu01@baidu.com Cognitive Computing Lab Baidu Research Bellevue, WA, USA Hongliang Fei hongliangfei@baidu.com Cognitive Computing Lab Baidu Research Bellevue, WA, USA Ping Li liping11@baidu.com Cognitive Computing Lab Baidu Research Bellevue, WA, USA ABSTRACT Owing to the efectiveness of cross-modal attentions, text-vision BERT models have achieved excellent performance in text-image retrieval. Nevertheless, cross-modal attentions in text-vision BERT models require expensive computation cost when tackling text- vision retrieval due to their pairwise input. Therefore, normally, it is impractical for deploying them for large-scale cross-modal retrieval in real applications. To address the inefciency issue in exiting text-vision BERT models, in this work, we develop a novel architecture, cross-probe BERT. It devises a small number of text and vision probes, and the cross-modal attentions are efciency achieved through the interactions between text and vision probes. It takes lightweight computation cost, and meanwhile efectively exploits cross-modal attention. Systematic experiments on public benchmarks demonstrate excellent efectiveness and efciency of our cross-probe BERT. CCS CONCEPTS · Computing methodologies Artifcial intelligence; · In- formation systems Image search. KEYWORDS Cross-modal Retrieval; Multimedia Search; Cross-modal BERT; ACM Reference Format: Tan Yu, Hongliang Fei, and Ping Li. 2022. Cross-Probe BERT for Fast Cross- Modal Search. In The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 11-15, 2022, Madrid, Spain. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3477495.3531826 1 INTRODUCTION Inspired by great success achieved by self-attention mechanism of Transformer [32] and BERT [4] in NLP tasks, several text-vision BERT models [14, 17, 20, 37] emerge. They take the query-image pair as input and extend the original text-modal self-attention to the multi-modal self-attention. The text-vision BERT efectively models the interactions between image features and query features, provides contextual encoding for both image features and query features, and achieves an excellent cross-modal retrieval. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. SIGIR, July 11-15, 2022, Madrid, Spain © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-8732-3/22/07. . . $15.00 https://doi.org/10.1145/3477495.3531826 In spite of the high efectiveness achieved by text-vision BERT, the extremely high computation cost brought by pairwise input limits its practical usefulness, especially for large-scale cross-modal retrieval in industrial applications. Given a query and reference images, it needs to feed query-image pairs to text-vision BERT for relevance scores. That is, it requires to repeatedly encode the query for times. In a large-scale cross-modal retrieval task, is extremely large, making text-vision BERT prohibitively slow for obtaining relevant scores with all reference images. In contrast, a two-tower encoder only needs to encode the query for one time, and reference image features can be pre-computed ofine and cached in the database. Thus, it obtains relevance scores between the query and images in an efcient manner by computing similar- ities between the query feature and pre-computed image features. Though the inefcient pairwise attention limits the usefulness of text-vision BERT in large-scale cross-modal retrieval, there are few works to speed up text-vision BERT. In fact, the inefciency caused by pairwise input is a general problem which is also encountered in other retrieval tasks such as query-to-document retrieval [10], ques- tion answering [2, 23]. In these tasks, similarly there are two main- stream encoders for obtaining the relevance score. The frst type of encoder, Bi-encoder [5], is based on the two-tower architecture. Since the query/question and document are independently encoded, document features can be pre-computed and cached. In this case, the relevance between the query and each document can be deter- mined by the cosine similarity between the query/question’s feature and the document’s cached feature. It achieves a high efciency but a relatively low retrieval accuracy. In contrast, Cross-encoder [31] takes a question-answer pair as input, exploiting the cross-attention like text-vision BERT and achieving high retrieval accuracy but is inefcient. To balance efectiveness and efciency, existing meth- ods [2, 23] adopt the two-tower architecture in lower layers and use the cross-attention architecture in the upper layers. We term this architecture as łsplit-mergež encoder. In this case, features from lower two-tower layers are pre-computed and cached. Then question-answer attentions are conducted in upper layers. Since the number of upper cross-attention layers is small, the efciency is boosted. Similarly, Poly-Encoder [10] conducts the two-tower archi- tecture for feature extraction, and uses an additional cross-attention layer on the top to obtain the similarity. In this paper, we propose a novel architecture, cross-probe BERT (CP-BERT) for efective and efcient cross-modal retrieval. We devise several vision probes and text probes along with the image’s local features and the query’s word features. In the lower a few layers, we adopt the two-tower architecture. The vision probes and the image’s local features are concatenated and fed into the vision tower and generate the attended vision probes. In parallel, the text probes and the query’s word features are concatenated and fed into Short Research Paper SIGIR ’22, July 11–15, 2022, Madrid, Spain 2178