MGPolicy: Meta Graph Enhanced Of-policy Learning for
Recommendations
Xiangmeng Wang
xiangmeng.wang@student.uts.edu.au
University of Technology Sydney
Sydney, Australia
Qian Li
2∗
qli@curtin.edu.au
Curtin University
Perth, Australia
Dianer Yu
Dianer.Yu-1@student.uts.edu.au
University of Technology Sydney
Sydney, Australia
Zhichao Wang
zchaoking@gmail.com
University of New South Wales
Sydney, Australia
Hongxu Chen
Hongxu.Chen@uts.edu.au
University of Technology Sydney
Sydney, Australia
Guandong Xu
2
guandong.xu@uts.edu.au
University of Technology Sydney
Sydney, Australia
ABSTRACT
Of-policy learning has drawn huge attention in recommender sys-
tems (RS), which provides an opportunity for reinforcement learn-
ing to abandon the expensive online training. However, of-policy
learning from logged data sufers biases caused by the policy shift
between the target policy and the logging policy. Consequently,
most of-policy learning resorts to inverse propensity scoring (IPS)
which however tends to be over-ftted over exposed (or recom-
mended) items and thus fails to explore unexposed items.
In this paper, we propose meta graph enhanced of-policy learn-
ing (MGPolicy), which is the frst recommendation model for cor-
recting the of-policy bias via contextual information. In particular,
we explicitly leverage rich semantics in meta graphs for user state
representation, and then train the candidate generation model to
promote an efcient search in the action space. Moreover, our MG-
policy is designed with counterfactual risk minimization, which
can correct policy learning bias and ultimately yield an efective
target policy to maximize the long-run rewards for the recommen-
dation. We extensively evaluate our method through a series of
simulations and large-scale real-world datasets, achieving favor-
able results compared with state-of-the-art methods. Our code is
currently available online
1
.
CCS CONCEPTS
· Theory of computation → Reinforcement learning;· Infor-
mation systems → Recommender systems.
∗
Contributing equally with the frst author.
2
Corresponding author.
1
https://www.dropbox.com/sh/9ugr1lx7gzwfub4/AABY46hVG6qKJnGAWjRJZMFKa?
dl=0
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
SIGIR ’22, July 11ś15, 2022, Madrid, Spain
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-8732-3/22/07. . . $15.00
https://doi.org/10.1145/3477495.3532021
KEYWORDS
Recommendation; Of-policy Learning; Counterfactual Risk Mini-
mization; Bias
ACM Reference Format:
Xiangmeng Wang, Qian Li, Dianer Yu, Zhichao Wang, Hongxu Chen, and Guan-
dong Xu. 2022. MGPolicy: Meta Graph Enhanced Of-policy Learning for
Recommendations. In Proceedings of the 45th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR
’22), July 11ś15, 2022, Madrid, Spain. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3477495.3532021
1 INTRODUCTION
Recommender system (RS) has become prevalent in Web applica-
tions to help users seek preferred content from massive information
provided [4]. Traditional RS including collaborative fltering and
knowledge-based systems [12] treat the recommendation as a static
process following a fxed greedy strategy. Traditional RS are static
and can not adapt to the sequential nature of user interaction with
the system. Recently, Reinforcement Learning (RL) that learns the
optimal target recommendation policy to maximize long-term user
satisfaction has drawn huge attention in RS [2]. RL-based recom-
mendation trains an agent (recommender) via online learning from
real-time user interaction trajectories. However, such online learn-
ing is infeasible in real RS since it might harm user satisfaction and
deteriorate the revenue of the platform [1]. Fortunately, of-policy
learning emerges as a favorable opportunity for policy optimiza-
tion, which uses logged user feedback instead of expensive online
interactive environments [15].
To abandon the online training, as shown in Figure 1, the of-
policy learning needs to fnd an optimal target policy
that maxi-
mizes users’ long-term satisfactions by given logged data collected
by the logging policy
0
. Thus, the of-policy learning has to fun-
damentally address the counterfactual question: what the cumu-
lative reward (i.e., users’ feedback during a period) would be if a
new target policy had been deployed instead of the original logging
policy [29]. Nevertheless, using the logged feedback data for an-
swering this counterfactual question is not easy, since the target
policy is diferent from the historical logging policy in the of-policy
setting [23, 25, 28]. As shown in Figure 1, the two policies hold dif-
ferent distributions while rare actions chosen by the target policy
Topic 18: Recommender System SIGIR ’22, July 11–15, 2022, Madrid, Spain
1369