5244 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 29, 2020
Attribute-Guided Attention for Referring Expression
Generation and Comprehension
Jingyu Liu , Wei Wang, Liang Wang, Fellow, IEEE , and Ming-Hsuan Yang , Fellow, IEEE
Abstract— Referring expression is a special kind of verbal
expression. The goal of referring expression is to refer to
a particular object in some scenarios. Referring expression
generation and comprehension are two inverse tasks within
the field. Considering the critical role that visual attributes
play in distinguishing the referred object from other objects,
we propose an attribute-guided attention model to address the
two tasks. In our proposed framework, attributes collected from
referring expressions are used as explicit supervision signals
on the generation and comprehension modules. The online
predicted attributes of the visual object can benefit both tasks
in two aspects: First, attributes can be directly embedded into
the generation and comprehension modules, distinguishing the
referred object as additional visual representations. Second, since
attributes have their correspondence in both visual and textual
space, an attribute-guided attention module is proposed as a
bridging part to link the counterparts in visual representation
and textual expression. Attention weights learned on both visual
feature and word embeddings validate our motivation. We exper-
iment on three standard datasets of RefCOCO, RefCOCO+ and
RefCOCOg commonly used in this field. Both quantitative and
qualitative results demonstrate the effectiveness of our proposed
framework. The experimental results show significant improve-
ments over baseline methods, and are favorably comparable to
the state-of-the-art results. Further ablation study and analysis
clearly demonstrate the contribution of each module, which could
provide useful inspirations to the community.
Index Terms— Referring expression, generation, comprehen-
sion, attributes, attribute-guided attention.
I. I NTRODUCTION
R
EFERRING expression is often a noun phrase to identify
an object in a discourse. It is frequently used in our
daily conversation when a speaker needs to refer or indicate a
Manuscript received April 27, 2018; revised March 8, 2019 and
December 11, 2019; accepted February 24, 2020. Date of publication
March 12, 2020; date of current version March 26, 2020. This work was
supported in part by the Major Project for New Generation of AI under Grant
2018AAA0100402, in part by the National Key Research and Development
Program of China under Grant 2016YFB1001000, in part by the National
Natural Science Foundation of China under Grant 61525306, Grant 61633021,
Grant 61721004, Grant 61420106015, Grant 61806194, and Grant U1803261,
in part by the Capital Science and Technology Leading Talent Training Project
under Grant Z181100006318030, HW2019SOW01, and in part by CAS-
AIR. The associate editor coordinating the review of this manuscript and
approving it for publication was Prof. Sos S. Agaian. (Corresponding author:
Jingyu Liu.)
Jingyu Liu is with the School of Electronics Engineering and
Computer Science, Peking University, Beijing 100871, China (e-mail:
jingyu.liu@pku.edu.cn).
Wei Wang and Liang Wang are with the National Laboratory of Pattern
Recognition (NLPR), Center for Research on Intelligent Perception and
Computing (CRIPAC), Institute of Automation, Chinese Academy of Sciences
(CASIA), Beijing 100190, China, and also with the Chinese Academy of
Sciences Artificial Intelligence Research (CAS-AIR), Beijing 100190, China
(e-mail: wangwei@nlpr.ia.ac.cn; wangliang@nlpr.ia.ac.cn).
Ming-Hsuan Yang is with the School of Engineering, University of Califor-
nia at Merced, Merced, CA 95344 USA (e-mail: mhyang@ucmerced.edu).
Digital Object Identifier 10.1109/TIP.2020.2979010
Fig. 1. Referring expression in everyday life to identify an object. The green
box and blue boxes stand for the referring object and other objects respectively.
Both the attributes “closer” and “red” make the target unambiguous.
particular object to a listener. Imagine a dialogue between two
viewers before a crowd of people in Figure 1. The speaker can
use the expression “The closer boy in red” to refer the target,
then the listener can successfully comprehend which person
is referred to by attributes of “closer” and “red”. Note that
lacking either attribute will make it ambiguous.
Regarding the two tasks in computer vision, refer-
ring expression generation and comprehension are mutually
inverse. The task of generation requires the model to generate
unambiguous expressions for a target object in the image. On
the other side, comprehension requires the model to understand
the received expression, accomplishing it by localizing the
referred object in the image. Figure 2 illustrates referring
expression comprehension and generation in two rows respec-
tively. The green and blue boxes denote ground truths and
comprehended objects respectively.
Referring expression comprehension is a newer task which
outputs the object’s location given the expression. Practical
approaches often accomplish this task in two steps: First,
generate a group of candidate objects via object detectors.
Second, pick the referred object from the candidates. Recent
approaches focus on how to design a ranking-based strategy
to retrieve the referred object in the second step, and mainly
formalizing it in two ways. The first one addresses the problem
as the inverse process of the generation. By the generation
model, the probability P (r |o) of the referring expression r
given the object o can be obtained. By Bayes’s rule, given
r , P (o|r ) can be obtained by converting P (r |o). The second
one addresses the problem in a image/text retrieval approach.
The visual and textual representation of the target object are
embedded into a common space, then a distance metric is
1057-7149 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on March 24,2021 at 11:00:23 UTC from IEEE Xplore. Restrictions apply.