Reinforcement Learning: Theory and Practice Csaba Szepesvari Associative Computing Ltd. Budapest 1121, Konkoly Thege M. lit 29 33 e-mail: szepes@mindmaker.kfkipark.hu Abstract \Ve consider reinforcement lerning methods for the solution of complex sequentil optimization problems. In prticular, the soundness of tV'O methods proposed for the solution of partially obsenr- able problems will be shown. The irst method suggests a state-estimation scheme and requires mild a priori knowledge, \vhile the second method assumes that a significant amount of abstract knowl edge is available about the decision problem and uses this knowledge to setup a macro-hierarchy to turn the partially observable problem into another one which cn already be handled using methods worked out for observable problems. This second method is also illustrated with some experiments on a. real-robot. 1 INTRODUCTION Reinforcementle.rning(RL)concernstheproblemoflearningtocontrolaprocesssuchth.talong-term performance criterion is optimized which is deined in terms of some observable, local (or immediate) reinforcement values. Iost RL algorithms assume that the optimization problem can be transformed into a Þrkovil Decision Problem (MDP) and thus enjoy a irm theoretical bckground which can be cribed to the fact that Bellman's principle ofoptimlity holds in VlDPs. In order to explain this remrkableprinciplelet usrst deπne1DPs. Alarkovian DecisionProcessis completely speciπedby its state, asetofpossibleactionsfor each state, a (sometimesstochastic)dynamics which describes the evaluation ofstates under the execution of actions, the immediate rewards that we encounter when a state-trnsition occurs nd the ay immediate rewards shouldbeglued together to yield the long-term reward (this can be as simple as adding together the immediate rewads). In mathematicalterms, an MD ! cn be represented by a tuple (X, A, A,p, r, Q), where X is the state-spce (this is usully a initeset) A istheaction-space(typicallyanotheriniteset) A : X + 2A (2A denotesthepowersetof A) determines thesetofadmissibleactionsforeachstatewhichmustbenon-empty p : X x A x X + [0, 11 is atransitionprobabilityfunction, p(x, ., y ) beingtheprobability thatthenextstateis y giventhatction . isexecuted instate xi r : �y x A + t is theimmediatereardfunction and Q isarulethatprescribes thewayimmediaterewardsdetermine thelong-termoutcome ofanactionsequence. Bellman'sprinciple of optimality claims that the optimal action in a state is the one whose long-term return is mimal if the actionsfromthenextstepare alwayschosen optimally. AMDPcalbethoughtofas a framework foroptimalcontrol: thecontrols reidentiiedbytheactionsandthe (stochastic) dynamicsisdescribed bythe transition probailities. hen a system thatcanbemodeledby an IDP is controlledthestate of the system can beeither observable orunobservable (hidden) to the controller. In the unobservable casethecontrollerobserves onlya(typicallynotone-to-one)functionof thestateandshouldcontrol the systemusing thisrestricted informationonly. MostbasicRLalgorithmsaredevelopedforinite-size(and evensmall), observableVlDPssincethesearemathematicallyveryattractive. In contrast, most rel-world applications can only be represented by models in which observability doesnothold,noristhesizeofthemodelsmallandsometimesthemodelisnoteeninite. Thefollowing list showscommonsituations where Rhasalread�yfound good applications or may foundapplications inthe future lUnlcss o.herwise noted, for most or .he developmen.s in .his paper we will assume tha. A(x) = I holds ror all x in ·which case A ' .. ·il1not be mentioned explicitly. 2 Specifically, Q : Rx x Rx x A + Rx xA. Par example, if the rule is to take the expected value of the sum of the immediate rewards then [Q(V,r)](x,a) = r(x,a) + :yEx p(x,a,y)v(y): here the value v(y) of the generic function V represents the long-term reward when the decision process is started in y. Usually, Q is not listed in the definition of rDPs as it might be fixed otherwise.