Adaptive Optimal Control for Time-Varying Discrete-Time Linear Systems Shuzhi Sam Ge 1 , Chen Wang 1 and Yanan Li 2 Abstract: In this paper, adaptive optimal control is pro- posed for time-varying discrete linear system subject to un- known system dynamics. The idea of the method is a direct application of the Q-learning adaptive dynamic programming for time-varying system. In order to derive the optimal control policy, a actor-critic structure is constructed and time-varying least square method is adopted for parameter adaptation. It has shown that the derived control policy can robustly stabilize the time varying system and guarantee an optimal control performance at the same time. As no partic- ular system information is required throughout the process, the proposed techniques provide a potential feasible solution to a large variety of control application. The validity of the proposed method is veriﬁed through simulation studies. Index Terms – adaptive dynamic programming; adaptive optimal control; time-varying linear discrete system I. I NTRODUCTION All control problems involve regulating a system’s input so that the system can meet some desired speciﬁcations. Among all the design criteria, stability is the most fundamental re- quirement. Controllers that can robustly stabilize a particular set of plant have already been studied in many research works [1], [2], [3], [4], [5]. The development of modern industry has raised a higher goal which require the system can be stabilized with small amount of energy. This higher goal is generally considered as the optimality of a system which is explained as “the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the ﬁrst decision.”[6]. In the literature of optimal control, two approaches are most widely studied, , i.e., classical optimal control based on maximum principle [7], [8] and dynamic programming [6]. In classical optimal control, a ﬁnite or inﬁnite cost function is deﬁned to describe the control performance, and control policy is generated by solving the famous Algebra Riccati Equation (ARE) where the state-space model of the system is assumed to be known. In dynamic programming [6], the optimal solution is acquired by solving a sequence of parti- 1 Shuzhi Sam Ge and Chen Wang are with the Social Robotics Lab, Inter- active & Digital Media Institute (IDMI) and the Department of Electrical & Computer Engineering, National University of Singapore, Singapore 117576 samge@nus.edu.sg 2 Yanan Li is with NUS Graduate School for Integrative Sciences and Engineering (NGS), National University of Singapore, Singapore 117456 liyanan84@nus.edu.sg tioned problems, the characteristic of dynamic programming lies the multistage nature of the optimization procedure. However, in those conventional optimal control methods, the system information is assumed to be known, which indicates that the control engineer needs to make an effort to identify the system model in order to build an optimal controller. This process is usually quite tedious and sen- sitive to system uncertainties and disturbance, making it not desirable in real application. In addition, considering the time varying nature of most system plant model, the traditional methods are not practical in most situations due to the off-line optimization and slow response to plant pa- rameter variations. To tackle this problem, adaptive dynamic programming (ADP) or actor-critic learning is proposed in [9], [10]. ADP mimics the way that biological systems interact with the environment. In the scheme of ADP, the system is considered as an agent which modiﬁes its ac- tion according to the environment stimuli. The action is strengthened (positive reinforcement) or weakened (negative reinforcement) according to the evaluation of a critic. By using ADP, an optimal control policy can be generated with partial or none information of the system. This is a heuristic process where an agent tries to maximize its future rewards; in a control engineering context, the maximization of reward is equivalent to the minimization of a control cost. Among all these ADP approaches, most recognized discrete ADP algorithms are the heuristic dynamic programming (HDP), globalized DHP (GDHP), action-dependent heuristic dynamic programming (ADHDP) or Q-learning [11] and dual-heuristic programming (DHP). The common feature of these ADP algorithms is that the design of optimal controller only requires partial information of the system model to be controlled. Among all these schemes, for discrete-time systems, ADHDP[12], or Q-learning [13], is an online iterative learn- ing algorithm which does not rely on the speciﬁc plant model to be controlled. Due to its unique online learing and control structure. This method has been widely applied in many research ﬁelds, such as nonzero-sum games [14], robotic arm control [15] and optimal output feedback control designs [16]. The idea of the proposed adaptive optimal control method is similar to [14]. However, instead of developing a adaptive optimal control for a time invariant system, we further considered the time-varying nature of actual plant model and try to solve the following difﬁculties, i.e., (a) the time- varying system parameters are not computationally tractable in real practice, and (b) when the parameters of the plant