Actor Based Simulation for Closed Loop Control of Supply
Chain using Reinforcement Learning
Extended Abstract
Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita Baniwal,
Prashant Kumar, Monika Gajrani
Tata Consultancy Services Research, India
souvik.barat@tcs.com,harshad.khadilkar@tcs.com
ABSTRACT
Reinforcement Learning (RL) has achieved a degree of success in
control applications such as online gameplay and robotics, but has
rarely been used to manage operations of business-critical sys-
tems such as supply chains. A key aspect of using RL in the real
world is to train the agent before deployment, so as to minimise
experimentation in live operation. While this is feasible for online
gameplay (where the rules of the game are known) and robotics
(where the dynamics are predictable), it is much more difcult
for complex systems due to associated complexities, such as un-
certainty, adaptability and emergent behaviour. In this paper, we
describe a framework for efective integration of a reinforcement
learning controller with an actor-based simulation of the complex
networked system, in order to enable deployment of the RL agent
in the real system with minimal further tuning.
KEYWORDS
Reinforcement learning; Simulation of complex systems; Model
based simulation
ACM Reference Format:
Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita
Baniwal, Prashant Kumar, Monika Gajrani. 2019. Actor Based Simulation for
Closed Loop Control of Supply Chain using Reinforcement Learning. In Proc.
of the 18th International Conference on Autonomous Agents and Multiagent
Systems (AAMAS 2019), Montreal, Canada, May 13ś17, 2019, IFAAMAS,
3 pages.
1 INTRODUCTION
Business-critical systems need to continually make decisions to
stay competitive and economically viable in a dynamic environ-
ment. Reinforcement Learning (RL) [9, 11] is a class of machine
learning algorithms that can be used for controlling such complex
systems in an adaptive and fexible manner. The goal of the system
controller (also called RL agent ) is to learn to take the best possible
control actions in each possible state of the system, in order to
maximise long-term system objectives. A crucial aspect of RL is the
computation of next state and associated rewards for the chosen
action(s), in a closed loop to enable learning. The setup is illustrated
in Figure 1. This paper argues that the use of analytical expressions
for modelling the environment is infeasible for complex systems,
and advocates an agent/actor based modelling abstraction [1, 8] as
Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13ś17, 2019,
Montreal, Canada. © 2019 International Foundation for Autonomous Agents and
Multiagent Systems (www.ifaamas.org). All rights reserved.
an efective modelling aid to understand the dynamics of such com-
plex systems. We present a framework that uses RL for exploring
policies and deciding control actions, and actor-based simulation
for performing accurate long-term rollouts of the policies, in order
to optimise the operation of complex systems. We use the domain
of supply chain replenishment as a representative example.
2 PROBLEM FORMULATION
We illustrate the generic reinforcement learning problem in the
context of supply chain replenishment, which presents well-known
difculties for efective control [7, 10]. The scenario is that of a
grocery retailer with a network of stores and warehouses served
by a feet of trucks for transporting products. The goal of replen-
ishment is to regulate the availability of the entire product range
in each store at all times, subject to the spatio-temporal constraints
imposed by available stocks, labour capacity, truck capacity, trans-
portation times, and available shelf space for each product in each
store. A schematic of the fow of products is shown in Figure 2.
From operational perspective, each store stocks i = {1,..., k }
unique varieties of products, each with a maximum shelf capacity
c
i, j
where j ≤ n is the index of the store. Further, let us denote by
x
i, j
(t ) the inventory of product i in store j at time t . The replen-
ishment quantities (actions) for delivery moment d are denoted by
a
i, j
(t
d
), and are to be computed at time (t
d
- Δ) where Δ is the
lead time. The observation O (t
d
- Δ) consists of the inventory of
each product in each store at the time, the demand forecast for
each product between the next two delivery moments, and meta-
data such as unit volume and weight, and shelf life. The inventory
x
i, j
(t ) depletes between two delivery moments ( d - 1) and d , and
undergoes a step increase by amount a
i, j
(t
d
) at time t
d
.
The reward r (t
d -1
) is a function of the actions a
i, j
(t
d -1
) and the
inventory x
i, j
(t ) in t ∈[t
d -1
, t
d
). Two quantities are of particular
interest: (i) the number of products that remain available throughout
the time interval [t
d -1
, t
d
), and (ii) the wastage of any products
State s(t)
Actions a(t)
Reward r(t - 1)
Environment RL Agent
Figure 1: Interaction of RL agent with an environment.
Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada
1802