TRAJECTORY PLANNING OF A ROBOT USING LEARNING ALGORITHMS A. Tsoularis, C. Kambhampati, K. Warwick. The University of Reading, U. K ABSTRACT We consider the problem of a robot manipula- tor operating in a noisy workspace. The manipulator is required to move from an in- itial position PI to a final position PI . Pi is assumed to be completely defined. How- ever, PI is obtained by a sensing operation and is assumed to be fixed but unknown. Our approach to this problem involves the use of three learning algorithms, the Discretized Linear Reward-Penalty (DL,-,) automaton, the Linear Reward-Penalty (L,-,) automaton and a nonlinear reinforcement scheme . An automaton is placed at each joint of the robot and by acting as a decislon maker, plans the tralectory based on noisy measure- ments of P,. INTRODUCTION The robot control problem is to design stable and robust algorithm to control the robot to follow a specified trajectory.(l,2,3) In order to plan a trajectory, transfor- mations from the Cartesian space of the end-effector to the joint space of the ma- nipulator must be performed; that is an inverse kinematic solution must be found. Then the forces and torques to be applled to the joints to achieve the desired trajectory must be computed; that is the problem of the inverse dynamics. The operation of a robot in a noise-free workspace is the subject of extensive research. A good survey of control algorithms can be found in (2) . In this article we will briefly desribe solutions which incorporate some form of learning. Arimoto and others (4.5) proposed an iter- ative learning structure for the operation of a robot such that the (n+l)th input to the joint actuators is the sum of the nth input and an error increment composed of the de- rivative difference between the nth motion trajectory and the given desired motion trajectory: u(nt1) = u(n) + r 1;L (y,-y (n)) dt where r is a positive-definite constant gain matrix,^ (n) and u (ntl) are the nth and (nt1)th input respectively, and y, is the desired trajectory. They showed that for a class of linear and nonlinear dynamical systems the learning process converges in the sense that y (n)+y, as njm. Miller and others (6) proposed a technique for the control of a robot manipulator based on the Cerebellar Model Articulate Controller (CMAC) developed by Albus. The control scheme requires no a priory knowledge of the robot dynamics as this is acquired on-line by observing the robot input-output values and altering the values already held In the CMAC memory module. Miyamoto and others (7) proposed a hierar- chical neural network scheme for the control of a three degree of freedom industrial manipulator. The total torque T fed to the actuator is the sum of the feedback torque TI, whlch is calculated from the trajectory error Od-O multiplied by the feedback gain K, and the feedforward torque Ti, which is calculated by the inverse dynamics model. They called their scheme "feedback-error - learning" to stress the fact that the output of the lower level neural network is used as an error signal for learning of the higher level neural network. In this articIe we suggest an alternative solution based on the theory of stochastic automata. A learning automaton is placed at every joint of the manipulator and based on repeated noisy observatlons of PI, denoted by . 4 PI, updates its actions accordingly. LEARNING AUTOMATA Learning Automata are systems that improve their performance in random environments. A learning automaton operating in a random en- vironment updates its strategy for choosing actions on the basis of the environment's response. The automaton has a finite number of actions and corresponding to each action, the response of the environment can be either favourable or unfavourable with a certain probability. The basic structure of a single learning automaton consists of an automaton and an environment connected in a feedback config- uration (see Fig. 1). The automaton performs one action a,out of a finite set a = (a,,a,, . . . ,am), where lSiSm. The environment provides a response b which is binary b=(0,1), where b=O represents suc- cess and b=l failure. The environment can be desribed by a set of penalty probabilities c,, where lSi4n. c, is defined by