RLS Algorithms and Convergence Analysis Method for Online DLQR Control Design via Heuristic Dynamic Programming Watson R. M. Santos, Jonathan A. Queiroz, Jo˜ao Viana da F. Neto, Patr´ıcia H. M. Rˆego, Ewaldo Santana and Gustavo Andrade Federal University of Maranh˜ao, Federal Institute of Maranh˜ao, State University of Maranh˜ao Embedded Systems and Intelligent Control Laboratory S˜ao Luis - Maranh˜ao - Brazil e-mail: watson.itz@ifma.edu.br jviana@dee.ufma.br Abstract — In this paper, a method to design online optimal policies that encompasses Hamilton-Jacobi-Bellman (HJB) equation solution approximation and heuristic dynamic programming (HDP) approach is proposed. Recursive least squares (RLS) algorithms are developed to approximate the HJB equation solution that is supported by a sequence of greedy policies. The proposal investigates the convergence properties of a family of RLS algorithms and its numerical complexity in the context of reinforcement learning and optimal control. The algorithms are computationally evaluated in an electric circuit model that represents an MIMO dynamic system. The results presented herein emphasize the convergence behaviour of the RLS, projection and Kaczmarz algorithms that are developed for online applications. Keywords — Recursive Least Squares; Heuristic Dynamic Programming; RLS Convergence; MIMO Dynamic Systems; Optimal Control; Adaptive Dynamic Programming. I. INTRODUCTION In order to overcome the curse of dimensionality problem in applications of Dynamic Programming approach, such as online design of optimal control [1] [2] [3], a lot of efforts has been spent to develop methods to approximate the solution Hamilton-Jacobi-Bellman (HJB) equation [4] [1]. Recently, the development of approximate dynamic programming (ADP) methods and algorithms has been proposed to improve the numerical stability [5] and increase convergence speed of the RLS algorithms family [6] [7]. In this article, a proposal to reduce computational complexity of approximate solution of the HJB equation underlying the discrete linear quadratic regulator (DLQR) problem via derivatives of recursive least square (RLS) method, such as Kacmarz and projection algorithms is presented. The online DLQR Optimal Control design based on heuristic dynamic programming (HDP) schema is the general context of the main contributions presented in this paper, including recursive least square (RLS) algorithms in addition to convergence analysis method. The RLS algorithms are developed to approximate the Hamilton- Jacobi-Bellman (HJB) equation solution and the convergence is a general procedure to select the parameters of these algorithms. Specifically, an approximation method based on RLS approach to solve the discrete algebraic Riccati equation (DARE), which is a particular form of the HJB equation of the discrete linear quadratic regulator (DLQR), is presented. A convergence analysis method for state and value functions of heuristic dynamic programming algorithms is based on non-singularities of Kronecker transformation of state regressor matrix of RLS estimators. From our point of view, online control design means that the controller gains are automatically adjustable as the dynamic process is in operation mode. The present proposal is in the context of reinforcement leaning (RL), approximate dynamic programming (ADP) and policy iteration to develop online algorithms for the DLQR solution. These algorithms are based on the RLS method to approximate the Hamilton-Jacobi-Bellman equation solution of the DLQR parameterizations. The solutions are given for the discrete algebraic Riccati equation and optimal gain in state value function approximations, respectively. The reinforcement learning (control) and environment (process) are associeated with ADP, as show in Figure 1 representing the control system with classical feedback in the context of adaptive control that is performed by the critic and actor blocks. Specifically, the RL approach is based on minimizing the Bellman error and heuristic dynamic programming by means of a scheme that combines RLS value function approximation with policy improvement. This approach is directed to online solution of optimal control problems in the sense that the policy improvements are performed at every time step along the realization of state trajectory towards the optimal policy. Figure 1. Actor Critic Schematic in Control System 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation 978-1-4799-4923-6/14 $31.00 © 2014 IEEE DOI 10.1109/UKSim.2014.109 77