Reinforcement Learning: Theory and Practice Csaba Szepesvari Associative Computing Ltd. Budapest 1121, Konkoly Thege M. lit 29 33 e-mail: szepes@mindmaker.kfkipark.hu Abstract \Ve consider reinforcement lerning methods for the solution of complex sequentil optimization problems. In prticular, the soundness of tV'O methods proposed for the solution of partially obsenr- able problems will be shown. The irst method suggests a state-estimation scheme and requires mild a priori knowledge, \vhile the second method assumes that a significant amount of abstract knowl edge is available about the decision problem and uses this knowledge to setup a macro-hierarchy to turn the partially observable problem into another one which cn already be handled using methods worked out for observable problems. This second method is also illustrated with some experiments on a. real-robot. 1 INTRODUCTION Reinforcementle.rning(RL)concernstheproblemoflearningtocontrolaprocesssuchth.talong-term performance criterion is optimized which is deined in terms of some observable, local (or immediate) reinforcement values. Iost RL algorithms assume that the optimization problem can be transformed into a Þrkovil Decision Problem (MDP) and thus enjoy a irm theoretical bckground which can be cribed to the fact that Bellman's principle ofoptimlity holds in VlDPs. In order to explain this remrkableprinciplelet usrst deπne1DPs. Alarkovian DecisionProcessis completely speciπedby its state, asetofpossibleactionsfor each state, a (sometimesstochastic)dynamics which describes the evaluation ofstates under the execution of actions, the immediate rewards that we encounter when a state-trnsition occurs nd the ay immediate rewards shouldbeglued together to yield the long-term reward (this can be as simple as adding together the immediate rewads). In mathematicalterms, an MD ! cn be represented by a tuple (X, A, A,p, r, Q), where X is the state-spce (this is usully a initeset) A istheaction-space(typicallyanotheriniteset) A : X + 2A (2A denotesthepowersetof A) determines thesetofadmissibleactionsforeachstatewhichmustbenon-empty p : X x A x X + [0, 11 is atransitionprobabilityfunction, p(x, ., y ) beingtheprobability thatthenextstateis y giventhatction . isexecuted instate xi r : �y x A + t is theimmediatereardfunction and Q isarulethatprescribes thewayimmediaterewardsdetermine thelong-termoutcome ofanactionsequence. Bellman'sprinciple of optimality claims that the optimal action in a state is the one whose long-term return is mimal if the actionsfromthenextstepare alwayschosen optimally. AMDPcalbethoughtofas a framework foroptimalcontrol: thecontrols reidentiiedbytheactionsandthe (stochastic) dynamicsisdescribed bythe transition probailities. hen a system thatcanbemodeledby an IDP is controlledthestate of the system can beeither observable orunobservable (hidden) to the controller. In the unobservable casethecontrollerobserves onlya(typicallynotone-to-one)functionof thestateandshouldcontrol the systemusing thisrestricted informationonly. MostbasicRLalgorithmsaredevelopedforinite-size(and evensmall), observableVlDPssincethesearemathematicallyveryattractive. In contrast, most rel-world applications can only be represented by models in which observability doesnothold,noristhesizeofthemodelsmallandsometimesthemodelisnoteeninite. Thefollowing list showscommonsituations where Rhasalread�yfound good applications or may foundapplications inthe future lUnlcss o.herwise noted, for most or .he developmen.s in .his paper we will assume tha. A(x) = I holds ror all x in ·which case A ' .. ·il1not be mentioned explicitly. 2 Specifically, Q : Rx x Rx x A + Rx xA. Par example, if the rule is to take the expected value of the sum of the immediate rewards then [Q(V,r)](x,a) = r(x,a) + :yEx p(x,a,y)v(y): here the value v(y) of the generic function V represents the long-term reward when the decision process is started in y. Usually, Q is not listed in the definition of rDPs as it might be fixed otherwise.