Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes Junda Wu ∗ jw6466@nyu.edu New York University New York City, New York, USA Tong Yu worktongyu@gmail.com Carnegie Mellon University Pittsburgh, Pennsylvania, USA Shuai Li 2 shuaili8@sjtu.edu.cn Shanghai Jiao Tong University Shanghai, Shanghai, China ABSTRACT In vision-language retrieval systems, users provide natural lan- guage feedback to fnd target images. Vision-language explana- tions in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple ob- jects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries. In this paper, we discover that deconfounded learning is an im- portant step to provide better vision-language explanations. Thus we propose a deconfounded explainable vision-language retrieval system. By introducing deconfounded learning to pretrain our vision-language model, the spurious correlations in the model can be reduced through interventions by potential confounders. This helps to train more accurate representations and further enable bet- ter explainability. Based on explainable retrieval results, we propose novel interactive mechanisms. In such interactions, users can bet- ter understand why the system returns particular results and give feedback efectively improving the results. This additional feedback is sample efcient and thus alleviates the data limitation problem. Through extensive experiments, our system achieves about 60% improvements, compared to the state-of-the-art. CCS CONCEPTS · Computing methodologies → Causal reasoning and diag- nostics; · Information systems → Users and interactive re- trieval; Retrieval efciency; Presentation of retrieval results; Image search. ∗ The work is done while the student is an intern at Shanghai Jiao Tong University. 2 Corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. MM ’21, October 20ś24, 2021, Virtual Event, China © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00 https://doi.org/10.1145/3474085.3475366 KEYWORDS image retrieval; vision and language; explainable machine learning; casual learning ACM Reference Format: Junda Wu, Tong Yu, and Shuai Li. 2021. Deconfounded and Explainable Interactive Vision-Language Retrieval of Complex Scenes. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20ś24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https: //doi.org/10.1145/3474085.3475366 1 INTRODUCTION In vision-language retrieval systems, users provide natural language feedback to visual features of items, to express their preferences and fnd the desired items. Recent advances of vision-language retrieval systems lead to a wide range of successful applications, such as fashion [4, 12, 17, 42ś44], images of complex scenes [34], and movie posters and reviews [25]. Usually, these systems mainly focus on retrieval accuracy, but fail to let users understand the results. To enable the users to better understand the results and provide helpful feedback to improve the retrieval, it is highly desirable to develop explainable vision-language retrieval systems. In order to fnd the desired items more efciently and accurately, many works have proposed to develop explainable retrieval systems [11, 22, 24, 25]. However, compared to developing explainable systems for ap- plications with a single modality, it is much more challenging to provide model explainability in vision-language retrieval especially for complex scenes. One major challenge is that labeled data is lim- ited. Compared to labeling single modal data, labeling multimodal data requires much more human eforts. Also, the data quality can be relatively low, since users can be very subjective when labeling utterances and it is difcult to validate the quality of the user’s ut- terances in free-form [17, 39]. In the retrieval of complex scenes, the issue of limited labeled data is more severe. To fnd complex image scenes with multiple objects, the systems usually need to interact with users for multiple rounds and ask users to describe diferent objects [34]. Due to the complexity of the objects and user queries, even more labeled data is required than the single-round retrieval of the whole images [4, 12, 42]. The problem of limited labeled vision-language data can cause data selection biases, and result in spurious correlations learned by the models [1, 28, 29, 38, 40, 45]. In the process of training on limited labeled data, the biased labeled samples may lead to the knowledge of spurious correlations in the model, which makes less sense to users. The spurious correlations learned by models can further make the generated vision-language explanations inaccurate and hard to be understood. Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China 2103