Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem Verena Heidrich-Meisner and Christian Igel Institut f¨ ur Neuroinformatik, Ruhr-Universit¨at Bochum, Germany {Verena.Heidrich-Meisner,Christian.Igel}@neuroinformatik.rub.de Abstract. Two variable metric reinforcement learning methods, the natural actor-critic algorithm and the covariance matrix adaptation evo- lution strategy, are compared on a conceptual level and analysed exper- imentally on the mountain car benchmark task with and without noise. 1 Introduction Reinforcement learning (RL) algorithms address problems where an agent is to learn a behavioural policy based on reward signals, which may be unspe- ciﬁc, sparse, delayed, and noisy. Many diﬀerent approaches to RL exist, here we consider policy gradient methods (PGMs) and evolution strategies (ESs). This paper extends our previous work on analysing the conceptual similarities and diﬀerences between PGMs and ESs [1]. For the time being, we look at single representatives of each approach that have been very successful in their respective area, the natural actor critic algo- rithm (NAC, [2–5]) and the covariance matrix adaptation ES (CMA-ES, [6]). Both are variable metric methods, actively learning about the structure of the search space. The CMA-ES is regarded as state-of-the-art in real-valued evolu- tionary optimisation [7]. It has been successfully applied and compared to other methods in the domain of RL [8–12]. Interestingly, recent studies compare CMA- ES and variants of the NAC algorithm in the context of optimisation [13], while we look at both methods in RL. We promote the CMA-ES for RL because of its eﬃciency and, even more important, its robustness. The superior robustness compared to other RL al- gorithms has several reasons, but probably the most important reason is that the adaptation of the policy as well as of the metric is based on ranking poli- cies, which is much less error prone than estimating absolute performance or performance gradients. Our previous comparison of NAC and CMA-ES on diﬀerent variants of the single pole balancing benchmark in [1] indicate that the CMA-ES is more ro- bust w.r.t. to the choice of hyperparameters (such as initial learning rates) and initial policies compared to the NAC. In [1] the NAC performed on par with the CMA-ES in terms of learning speed only when ﬁne-tuning policies, but worse for