Actor Based Simulation for Closed Loop Control of Supply Chain using Reinforcement Learning Extended Abstract Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita Baniwal, Prashant Kumar, Monika Gajrani Tata Consultancy Services Research, India souvik.barat@tcs.com,harshad.khadilkar@tcs.com ABSTRACT Reinforcement Learning (RL) has achieved a degree of success in control applications such as online gameplay and robotics, but has rarely been used to manage operations of business-critical sys- tems such as supply chains. A key aspect of using RL in the real world is to train the agent before deployment, so as to minimise experimentation in live operation. While this is feasible for online gameplay (where the rules of the game are known) and robotics (where the dynamics are predictable), it is much more difcult for complex systems due to associated complexities, such as un- certainty, adaptability and emergent behaviour. In this paper, we describe a framework for efective integration of a reinforcement learning controller with an actor-based simulation of the complex networked system, in order to enable deployment of the RL agent in the real system with minimal further tuning. KEYWORDS Reinforcement learning; Simulation of complex systems; Model based simulation ACM Reference Format: Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita Baniwal, Prashant Kumar, Monika Gajrani. 2019. Actor Based Simulation for Closed Loop Control of Supply Chain using Reinforcement Learning. In Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), Montreal, Canada, May 13ś17, 2019, IFAAMAS, 3 pages. 1 INTRODUCTION Business-critical systems need to continually make decisions to stay competitive and economically viable in a dynamic environ- ment. Reinforcement Learning (RL) [9, 11] is a class of machine learning algorithms that can be used for controlling such complex systems in an adaptive and fexible manner. The goal of the system controller (also called RL agent ) is to learn to take the best possible control actions in each possible state of the system, in order to maximise long-term system objectives. A crucial aspect of RL is the computation of next state and associated rewards for the chosen action(s), in a closed loop to enable learning. The setup is illustrated in Figure 1. This paper argues that the use of analytical expressions for modelling the environment is infeasible for complex systems, and advocates an agent/actor based modelling abstraction [1, 8] as Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13ś17, 2019, Montreal, Canada. © 2019 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. an efective modelling aid to understand the dynamics of such com- plex systems. We present a framework that uses RL for exploring policies and deciding control actions, and actor-based simulation for performing accurate long-term rollouts of the policies, in order to optimise the operation of complex systems. We use the domain of supply chain replenishment as a representative example. 2 PROBLEM FORMULATION We illustrate the generic reinforcement learning problem in the context of supply chain replenishment, which presents well-known difculties for efective control [7, 10]. The scenario is that of a grocery retailer with a network of stores and warehouses served by a feet of trucks for transporting products. The goal of replen- ishment is to regulate the availability of the entire product range in each store at all times, subject to the spatio-temporal constraints imposed by available stocks, labour capacity, truck capacity, trans- portation times, and available shelf space for each product in each store. A schematic of the fow of products is shown in Figure 2. From operational perspective, each store stocks i = {1,..., k } unique varieties of products, each with a maximum shelf capacity c i, j where j n is the index of the store. Further, let us denote by x i, j (t ) the inventory of product i in store j at time t . The replen- ishment quantities (actions) for delivery moment d are denoted by a i, j (t d ), and are to be computed at time (t d - Δ) where Δ is the lead time. The observation O (t d - Δ) consists of the inventory of each product in each store at the time, the demand forecast for each product between the next two delivery moments, and meta- data such as unit volume and weight, and shelf life. The inventory x i, j (t ) depletes between two delivery moments ( d - 1) and d , and undergoes a step increase by amount a i, j (t d ) at time t d . The reward r (t d -1 ) is a function of the actions a i, j (t d -1 ) and the inventory x i, j (t ) in t ∈[t d -1 , t d ). Two quantities are of particular interest: (i) the number of products that remain available throughout the time interval [t d -1 , t d ), and (ii) the wastage of any products State s(t) Actions a(t) Reward r(t - 1) Environment RL Agent Figure 1: Interaction of RL agent with an environment. Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada 1802