Steady-State Policy Synthesis in Multichain Markov Decision Processes George Atia 1 , Andre Beckus 1 , Ismail Alkhouri 1 and Alvaro Velasquez 2 1 Department of Electrical and Computer Engineering, University of Central Florida 2 Information Directorate, Air Force Research Laboratory george.atia@ucf.edu, {abeckus,ialkhouri}@knights.ucf.edu, alvaro.velasquez.1@us.af.mil Abstract The formal synthesis of automated or autonomous agents has elicited strong interest from the ar- tificial intelligence community in recent years. This problem space broadly entails the derivation of decision-making policies for agents acting in an environment such that a formal specification of behavior is satisfied. Popular formalisms for such specifications include the quintessential Lin- ear Temporal Logic (LTL) and Computation Tree Logic (CTL) which reason over infinite sequences and trees, respectively, of states. However, the re- lated and relevant problem of reasoning over the frequency with which states are visited infinitely and enforcing behavioral specifications on the same has received little attention. That problem, known as Steady-State Policy Synthesis (SSPS) or steady- state control, is the focus of this paper. Prior related work has been mostly confined to unichain Markov Decision Processes (MDPs), while a tractable solu- tion to the general multichain setting heretofore re- mains elusive. In this paper, we provide a solution to the latter within the context of multichain MDPs over a class of policies that account for all possible transitions in the given MDP. The solution policy is derived from a novel linear program (LP) that en- codes constraints on the limiting distributions of the Markov chain induced by said policy. We establish a one-to-one correspondence between the feasible solutions of the LP and the stationary distributions of the induced Markov chains. The derived policy is shown to maximize the reward among the con- strained class of stationary policies and to satisfy the specification constraints even when it does not exercise all possible transitions. 1 Introduction There has been a focus in recent years on the verification of autonomous systems by leveraging techniques used for decades in the model checking of software [Fisher et al., 2013]. While this verification step is crucial for the develop- ment of robust autonomous capabilities, a promising comple- mentary approach is to design these capabilities in such a way that the search for a correct design is driven by the same spec- ifications used for verification. This methodology is often called correct-by-design construction [Haesaert et al., 2015] or formal/controller synthesis [Kress-Gazit et al., 2018]. Our contribution is in the same vein and entails the search for poli- cies which satisfy constraints on the steady-state distribution of the resulting agent as it interacts with its environment for an indefinite period of time following said policies. It is worth noting that progress in this area has interesting applications to problems where steady-state distributions are commonly used. This includes the derivation of maintenance plans for various systems such that asymptotic failure rate is minimized [Boussemart and Limnios, 2004] [Boussemart et al., 2002] as well as to problems in constrained routing where average de- lay and packet loss metrics must be enforced [Lazar, 1983] [Skwirzynski, 1981]. Steady-State Policy Synthesis (SSPS) is framed in the con- text of constrained Markov Decision Processes (MDP) that model the agent-environment dynamics. This framework has long been studied in the stochastic dynamic control and oper- ations research literature to handle multi-objective decision- making in the presence of uncertainty. The pioneering work of Derman [Derman, 1970] and Altman [Altman, 1999] de- veloped a constrained optimization framework to dynamic control problems based on linear programming for both the discounted and total reward, as well as the expected aver- age reward formulations. The vast majority of existing work, however, have focused on ergodic or unichain structures. This was pointed out recently by Altman in [Altman et al., 2019], where it is stated that “...the existing theory for solving such problems requires strong assumptions on the ergodic struc- ture of the problem”. Under such assumptions, average- reward constrained MDPs have been shown to admit efficient solutions owing to an established one-to-one correspondence between the optimal solutions of a formulated linear program (LP) and the optimal policies of the MDP. The notable work of Kallenberg in [Kallenberg, 1983] has laid the groundwork for Markovian control problems and their characterizations in multichain settings and the construction of optimal poli- cies based on linear programming under several optimality criteria. However, the algorithms developed to construct an optimal policy for general multichain structures were shown to be computationally prohibitive for the expected average re- ward formulation. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 4069