MGPolicy: Meta Graph Enhanced Of-policy Learning for Recommendations Xiangmeng Wang xiangmeng.wang@student.uts.edu.au University of Technology Sydney Sydney, Australia Qian Li 2∗ qli@curtin.edu.au Curtin University Perth, Australia Dianer Yu Dianer.Yu-1@student.uts.edu.au University of Technology Sydney Sydney, Australia Zhichao Wang zchaoking@gmail.com University of New South Wales Sydney, Australia Hongxu Chen Hongxu.Chen@uts.edu.au University of Technology Sydney Sydney, Australia Guandong Xu 2 guandong.xu@uts.edu.au University of Technology Sydney Sydney, Australia ABSTRACT Of-policy learning has drawn huge attention in recommender sys- tems (RS), which provides an opportunity for reinforcement learn- ing to abandon the expensive online training. However, of-policy learning from logged data sufers biases caused by the policy shift between the target policy and the logging policy. Consequently, most of-policy learning resorts to inverse propensity scoring (IPS) which however tends to be over-ftted over exposed (or recom- mended) items and thus fails to explore unexposed items. In this paper, we propose meta graph enhanced of-policy learn- ing (MGPolicy), which is the frst recommendation model for cor- recting the of-policy bias via contextual information. In particular, we explicitly leverage rich semantics in meta graphs for user state representation, and then train the candidate generation model to promote an efcient search in the action space. Moreover, our MG- policy is designed with counterfactual risk minimization, which can correct policy learning bias and ultimately yield an efective target policy to maximize the long-run rewards for the recommen- dation. We extensively evaluate our method through a series of simulations and large-scale real-world datasets, achieving favor- able results compared with state-of-the-art methods. Our code is currently available online 1 . CCS CONCEPTS · Theory of computation → Reinforcement learning;· Infor- mation systems → Recommender systems. ∗ Contributing equally with the frst author. 2 Corresponding author. 1 https://www.dropbox.com/sh/9ugr1lx7gzwfub4/AABY46hVG6qKJnGAWjRJZMFKa? dl=0 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. SIGIR ’22, July 11ś15, 2022, Madrid, Spain © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-8732-3/22/07. . . $15.00 https://doi.org/10.1145/3477495.3532021 KEYWORDS Recommendation; Of-policy Learning; Counterfactual Risk Mini- mization; Bias ACM Reference Format: Xiangmeng Wang, Qian Li, Dianer Yu, Zhichao Wang, Hongxu Chen, and Guan- dong Xu. 2022. MGPolicy: Meta Graph Enhanced Of-policy Learning for Recommendations. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), July 11ś15, 2022, Madrid, Spain. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3477495.3532021 1 INTRODUCTION Recommender system (RS) has become prevalent in Web applica- tions to help users seek preferred content from massive information provided [4]. Traditional RS including collaborative fltering and knowledge-based systems [12] treat the recommendation as a static process following a fxed greedy strategy. Traditional RS are static and can not adapt to the sequential nature of user interaction with the system. Recently, Reinforcement Learning (RL) that learns the optimal target recommendation policy to maximize long-term user satisfaction has drawn huge attention in RS [2]. RL-based recom- mendation trains an agent (recommender) via online learning from real-time user interaction trajectories. However, such online learn- ing is infeasible in real RS since it might harm user satisfaction and deteriorate the revenue of the platform [1]. Fortunately, of-policy learning emerges as a favorable opportunity for policy optimiza- tion, which uses logged user feedback instead of expensive online interactive environments [15]. To abandon the online training, as shown in Figure 1, the of- policy learning needs to fnd an optimal target policy   that maxi- mizes users’ long-term satisfactions by given logged data collected by the logging policy  0 . Thus, the of-policy learning has to fun- damentally address the counterfactual question: what the cumu- lative reward (i.e., users’ feedback during a period) would be if a new target policy had been deployed instead of the original logging policy [29]. Nevertheless, using the logged feedback data for an- swering this counterfactual question is not easy, since the target policy is diferent from the historical logging policy in the of-policy setting [23, 25, 28]. As shown in Figure 1, the two policies hold dif- ferent distributions while rare actions chosen by the target policy Topic 18: Recommender System SIGIR ’22, July 11–15, 2022, Madrid, Spain 1369