Deconfounded and Explainable Interactive Vision-Language
Retrieval of Complex Scenes
Junda Wu
∗
jw6466@nyu.edu
New York University
New York City, New York, USA
Tong Yu
worktongyu@gmail.com
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
Shuai Li
2
shuaili8@sjtu.edu.cn
Shanghai Jiao Tong University
Shanghai, Shanghai, China
ABSTRACT
In vision-language retrieval systems, users provide natural lan-
guage feedback to fnd target images. Vision-language explana-
tions in the systems can better guide users to provide feedback
and thus improve the retrieval. However, developing explainable
vision-language retrieval systems can be challenging, due to limited
labeled multimodal data. In the retrieval of complex scenes, the
issue of limited labeled data can be more severe. With multiple ob-
jects in the complex scenes, each user query may not exhaustively
describe all objects in the desired image and thus more labeled
queries are needed. The issue of limited labeled data can cause data
selection biases, and result in spurious correlations learned by the
models. When learning spurious correlations, existing explainable
models may not be able to accurately extract regions from images
and keywords from user queries.
In this paper, we discover that deconfounded learning is an im-
portant step to provide better vision-language explanations. Thus
we propose a deconfounded explainable vision-language retrieval
system. By introducing deconfounded learning to pretrain our
vision-language model, the spurious correlations in the model can
be reduced through interventions by potential confounders. This
helps to train more accurate representations and further enable bet-
ter explainability. Based on explainable retrieval results, we propose
novel interactive mechanisms. In such interactions, users can bet-
ter understand why the system returns particular results and give
feedback efectively improving the results. This additional feedback
is sample efcient and thus alleviates the data limitation problem.
Through extensive experiments, our system achieves about 60%
improvements, compared to the state-of-the-art.
CCS CONCEPTS
· Computing methodologies → Causal reasoning and diag-
nostics; · Information systems → Users and interactive re-
trieval; Retrieval efciency; Presentation of retrieval results;
Image search.
∗
The work is done while the student is an intern at Shanghai Jiao Tong University.
2
Corresponding author
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’21, October 20ś24, 2021, Virtual Event, China
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8651-7/21/10. . . $15.00
https://doi.org/10.1145/3474085.3475366
KEYWORDS
image retrieval; vision and language; explainable machine learning;
casual learning
ACM Reference Format:
Junda Wu, Tong Yu, and Shuai Li. 2021. Deconfounded and Explainable
Interactive Vision-Language Retrieval of Complex Scenes. In Proceedings
of the 29th ACM International Conference on Multimedia (MM ’21), October
20ś24, 2021, Virtual Event, China. ACM, New York, NY, USA, 9 pages. https:
//doi.org/10.1145/3474085.3475366
1 INTRODUCTION
In vision-language retrieval systems, users provide natural language
feedback to visual features of items, to express their preferences and
fnd the desired items. Recent advances of vision-language retrieval
systems lead to a wide range of successful applications, such as
fashion [4, 12, 17, 42ś44], images of complex scenes [34], and movie
posters and reviews [25]. Usually, these systems mainly focus on
retrieval accuracy, but fail to let users understand the results. To
enable the users to better understand the results and provide helpful
feedback to improve the retrieval, it is highly desirable to develop
explainable vision-language retrieval systems. In order to fnd the
desired items more efciently and accurately, many works have
proposed to develop explainable retrieval systems [11, 22, 24, 25].
However, compared to developing explainable systems for ap-
plications with a single modality, it is much more challenging to
provide model explainability in vision-language retrieval especially
for complex scenes. One major challenge is that labeled data is lim-
ited. Compared to labeling single modal data, labeling multimodal
data requires much more human eforts. Also, the data quality can
be relatively low, since users can be very subjective when labeling
utterances and it is difcult to validate the quality of the user’s ut-
terances in free-form [17, 39]. In the retrieval of complex scenes, the
issue of limited labeled data is more severe. To fnd complex image
scenes with multiple objects, the systems usually need to interact
with users for multiple rounds and ask users to describe diferent
objects [34]. Due to the complexity of the objects and user queries,
even more labeled data is required than the single-round retrieval
of the whole images [4, 12, 42]. The problem of limited labeled
vision-language data can cause data selection biases, and result in
spurious correlations learned by the models [1, 28, 29, 38, 40, 45].
In the process of training on limited labeled data, the biased labeled
samples may lead to the knowledge of spurious correlations in the
model, which makes less sense to users. The spurious correlations
learned by models can further make the generated vision-language
explanations inaccurate and hard to be understood.
Poster Session 3 MM ’21, October 20–24, 2021, Virtual Event, China
2103