Reinforcement Learning Solution with Costate Approximation for a Flexible Wing Aircraft Mohammed Abouheaf School of Electrical Engineering and Computer Science University of Ottawa Ottawa, Ontario, Canada Email: mohammed.abouheaf@uottawa.ca Wail Gueaieb School of Electrical Engineering and Computer Science University of Ottawa Ottawa, Ontario, Canada Email: wail.gueaieb@uottawa.ca Abstract—An online adaptive learning approach based on costate function approximation is developed to solve an optimal control problem in real time. The proposed approach tackles the main concerns associated with the classical Dual Heuristic Dynamic Programming techniques in uncertain dynamical en- vironments. It employs a policy iteration paradigm along with adaptive critics to implement the adaptive learning solution. The resultant framework does not need or require prior knowledge of the system dynamics, which makes it suitable for systems with high modeling uncertainties. As a proof of concept, the suggested structure is applied for the auto-pilot control of a ﬂexible wing aircraft with unknown dynamics which are continuously varying at each trim speed condition. Numerical simulations showed that the adaptive control technique was able to learn the system’s dynamics and regulate its states as desired in a relatively short time. I. I NTRODUCTION The Dual Heuristic Dynamic Programming approaches em- ploy costate function approximations to solve the Dynamic Programming problems [1]–[4]. The challenges associated with these approaches involve the necessity to know the drift dynamics of the considered systems. This work introduces an online reinforcement learning solution based on the costate function approximation. This approach does not need the drift dynamics of the systems and it is suitable to control systems that cannot be accurately modeled or identiﬁed. It is used to solve the challenging control problem of the ﬂexible wing aircraft, where the aerodynamic model of the ﬂexible wing varies continuously (i.e. the drift dynamics are hard to model). The optimal control problems are formulated as decision processes in the framework of Artiﬁcial Intelligence. Dynamic Programming approaches are used to solve the optimal control problems in [1]–[5]. Approximate Dynamic Programming (ADP) approaches are used to ﬁnd approximate solutions for the Dynamic Programming problems in [2], [5], [6]. These approaches combine knowledge from Dynamic Pro- gramming, Reinforcement Learning (RL), and Adaptive Crit- ics [2]–[8]. ADP approaches are used in the cooperative con- trol, computational intelligence, decision making, and applied mathematics [9], [10]. Approximate dynamic Programming (ADP) techniques are developed to solve the optimal control problems online in [2] and ofﬂine in [11]. Reinforcement Learning (RL) approaches use different learning processes in dynamic environments [12]. These involve two-step techniques known as Value Iteration (VI) and Policy Iteration (PI) [6]. The Reinforcement Learning solutions are implemented using actor-critic neural networks [6]. Actor-critic neural networks are some forms of Temporal Difference (TD) methods with separate learning structures [13]. The actor approximates the optimal decision and applies it to the dynamic environment and then the quality of this decision is assessed and approximated using a critic structure [6]. Following this assessment, the actor and critic weights are updated [12], [13]. The adaptive critics are used to solve the optimal control problem in real- time in [14]. The optimal control problem ﬁnds the necessity optimality conditions and hence the optimal strategies [15]. Reinforcement Learning is used to solve the synchronization control problem online in [16]–[18]. The weight-shift control problem of the ﬂexible wing air- craft depends on the dynamic coupling between the pilot system and the wing system, when the pilot system changes its center of gravity relative to the wing’s center of gravity, this creates a pitch/roll control system that uses that shift in the centers of gravity [19]. The observable stability and control features of the ﬂexible wing aircraft are studied using the longitudinal static stability concepts of the conventional aeroplane in [20], [21]. The lateral directional stability margins of the ﬂexible wing aircraft are explained and shown to be larger compared to the conventional aeroplane [19]. The paper is organized as follows: Section 2 demonstrates the formulation of the optimal control problem. Section 3 discusses the challenges associated with implementing the classical Dual Heuristic Programming solutions. Section 4 in- troduces the proposed adaptive learning solution along with its implementation using the adaptive critics. Section 5 highlights the challenging weight-shift control problem of the ﬂexible wing aircraft along with the simulation results. II. THE OPTIMAL CONTROL PROBLEM This section highlights the formulation of the optimal con- trol problem for dynamical systems. The optimality conditions are found using the classical Bellman and Hamiltonian dy- namics. This mathematical framework is used to motivate the hence-coming reinforcement learning solution.