PATH PLANNING OF ROBOTS IN NOISY WORKSPACES USING LEARNING AUTOMATA A.Tsoularis, C.Kambhampati, K.Warwick Systems Research Group, Department of Cybernetics, University of Reading, Whiteknights, Reading RG6 2AB. UK *Person to whom all correspondence should be addressed email : shskambh @ reading.ac.uk cybat @ cyber.reading.ac.uk keywords: learning automaton, environment, fixed- structure automaton, variable-structure automaton, trajectory planning Abstract We consider the problem of a manipulator operating in a noisy workspace and required to move from an initial fixed position Poto a final position Pf . However, Pf is corrupted by noise, giving rise to Ff, which may be obtained by sensors. We propose the use of learning automata to tackle this problem. An automaton is placed at each joint of the manipulator which moves according to the action chosen by the automaton (forward, backward, stationary) at each instant. Let D(n) behthe distance of the end-effector from the target position P, (n) at the nth instant. If at the (n + 1 )th instant D(n + 1 ) < D(n)the automata are simultaneously rewarded, otherwise they are penalised. The simultaneous reward or penalty of the automata enables us to avoid any inverse kinematics computations that would be necessary if the distance of each joint from the final position had to be calculated. We employ three variable-structure learning algorithms, the Disretized Linear Reward-Penalty (DL,), the Linear Reward-Penalty (L , ) and a Nonlinear scheme. Each algorithm is separately tested with two (forward,backward) and three (forward,backward,stationary) actions. 1.Fundamentals of Learning Automata The basic structure of a single learning automaton consists of an automaton and an environment connected in a feedback configuration [I]. The automaton performs one action a, out of a finite set a=(al,a2 ,..., am) where 1 <ism. The environment provides a response b which is usually binary, b ={0,1}, where b = 0 represents success and b = 1 failure. The environment can be represented by a set of penalty probabilities ci , where c, is defined by Consequently, ci represents the probability that the application of an action ai to the environment will result in a penalty output. A stochastic automaton is defined by the quintuple (a,b,@Pfl where (i) a is a set of actions or outputs, a = {a,,a ,,..., a , ) . 0-7803-1 206-6/93/$3.00 01 993 IEEE Pmcdlnga of the 1993 Intamatlonal Symporlum on lntelligrnt Control Chlago, Illlndr, USA - August 1003 (ii) b is a binary input set. (iii)4 is a set of possible states for the automaton @=(4,.42,.--.4J (iv) F is the transition function, which determines the state at the next instant. (v) H is the output function, which determines the output of the automaton, i.e. its action, at any instant. Note that the number of states k must always be> the number of actions m. The automaton is basically a sequential machine admitting a sequence of inputs and producing a sequence of actions. Assuming the initial state 9(0) is known, the action a(0) is defined by H(#(O)). Subsequentlyfil) is determined by F@O),b(O)), where b(0) is the response of the environment to a(0). Note that the state and action at the instant n depend only on the state and input at the previous instant n-1. F and H can be either deterministic, that is given$(O), the successive states and actions can be uniquely determined, or stochastic, that is future states and actions can be predicted with certain probabilities. F and H are usually represented in matrix form [f,,], [hi,]. For a deterministic automaton, the transition matrices [f,], [hi,] consist of elements that are either 0 or 1 whereas for a stochastic automaton, these matrices are probabilities and therefore the sum of the row entries in each of the matrices is unity to conserve probability measure. If f, and hii are constants in the interval [0,1 I and do not vary with the instant n, then the automaton is called a Fixed-Structure Stochastic Automaton. If the probabilities fii and hi, are updated with n, then such an automaton is called a Variable-Structure Stochastic Automaton [1 I. A Learning Automaton is a stochastic automaton in a feedback arrangement with a random environment that updates its probability action vector p(n) = {p,,p2, ...,p,} at each stage n. The learning automaton is expected to keep improving its performance as n increases. We proceed now to define a quantity that serves as an average measure of the performance of a learning automaton. M(n) is the average (expected) penalty at instant n given the action probability vector p(n). If all actions are chosen with equal probability pi(n) = l/m then the average penalty is Such an automaton is called a "pure-chance automaton". Ideally the automaton should be choosing an action that results in minimum penalty by the environment as n ->m. As a crude measure of performance M(n) should be less than M, as n ->-. Since M(n) and p(n) are random variables we compare their expectations. A learning automaton is expedient if