Evolutionary Learning Outperforms Reinforcement Learning on Non-Markovian Tasks G. de Croon, M.F. van Dartel, and E.O. Postma IKAT, Universiteit Maastricht, P.O. Box 616, 6200 MD, Maastricht, The Netherlands {g.decroon, mf.vandartel, postma}@cs.unimaas.nl, http://www.cs.unimaas.nl Abstract. Artiﬁcial agents are often trained to perform non-Markovian tasks, i.e., tasks in which the sensory inputs can be ambiguous. Agents typically learn how to perform such tasks using either reinforcement learning (RL) or evolutionary learning (EL). In this paper, we empirically demonstrate that these learning methods result in diﬀerent levels of per- formance when applied to a non-Markovian task: the Active Categorical Perception (ACP) task. In the ACP-task, the proportion of ambiguous sensor states can be varied. EL outperforms RL for all tested proportions of ambiguous states. In addition, we show that the relative performance diﬀerence between RL and EL increases with the proportion of ambigu- ous sensor states. We argue that the cause of this increasing performance diﬀerence is that in RL the learned policy consists of those state-action pairs that individually have the highest estimated values, while the per- formance of a policy for a non-Markovian task highly depends on the combination of state-action pairs selected. 1 Introduction Artiﬁcial agents often have to perform tasks in which the sensory inputs can be ambiguous, referred to as non-Markovian tasks [8,1]. Agents typically learn how to perform such tasks with either reinforcement learning (RL) or evolutionary learning (EL). RL and EL diﬀer fundamentally in the manner in which they search in the policy space, i.e., the space of possible mappings from states to ac- tions. RL searches for the optimal policy by learning a value function of states or state-action pairs. Learning a value function is more diﬃcult in a non-Markovian task than in a Markovian task, since it is hard to estimate the value of an am- biguous sensor state [1, 9]. In contrast to RL, EL searches in the policy space directly, by selecting and evaluating complete policies. Therefore, EL does not depend on how well the values of ambiguous states can be estimated, but on the properties of the policy space in which it searches. This diﬀerence between value function search and direct policy search leads to the conjecture that EL performs better on non-Markovian tasks than RL does [1, 12, 5]. If the estimation of the value of ambiguous states is the problem for RL, EL should be better able to cope with a larger proportion of ambiguous states.