Learning to Learn without Gradient Descent by Gradient Descent Yutian Chen 1 Matthew W. Hoffman 1 Sergio G´ omez Colmenarejo 1 Misha Denil 1 Timothy P. Lillicrap 1 Matt Botvinick 1 Nando de Freitas 1 Abstract We learn recurrent neural network optimizers trained on simple synthetic functions by gradi- ent descent. We show that these learned optimiz- ers exhibit a remarkable degree of transfer in that they can be used to efﬁciently optimize a broad range of derivative-free black-box functions, in- cluding Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the train- ing horizon, the learned optimizers learn to trade- off exploration and exploitation, and compare favourably with heavily engineered Bayesian op- timization packages for hyper-parameter tuning. 1. Introduction Findings in developmental psychology have revealed that infants are endowed with a small number of separable systems of core knowledge for reasoning about objects, actions, number, space, and possibly social interactions (Spelke and Kinzler, 2007). These systems enable infants to learn many skills and acquire knowledge rapidly. The most coherent explanation of this phenomenon is that the learning (or optimization) process of evolution has led to the emergence of components that enable fast and varied forms of learning. In psychology, learning to learn has a long history (Ward, 1937; Harlow, 1949; Kehoe, 1988). Inspired by this, many researchers have attempted to build agents capable of learning to learn (Schmidhuber, 1987; Naik and Mammone, 1992; Thrun and Pratt, 1998; Hochre- iter et al., 2001; Santoro et al., 2016; Duan et al., 2016; Wang et al., 2016; Ravi and Larochelle, 2017; Li and Ma- lik, 2017). The scope of research under the umbrella of learning to learn is very broad. The learner can implement and be trained by many different algorithms, including gra- 1 DeepMind, London, United Kingdom. Correspondence to: Yutian Chen <yutianc@google.com>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). dient descent, evolutionary strategies, simulated annealing, and reinforcement learning. For instance, one can learn to learn by gradient descent by gradient descent, or learn local Hebbian updates by gra- dient descent (Andrychowicz et al., 2016; Bengio et al., 1992). In the former, one uses supervised learning at the meta-level to learn an algorithm for supervised learning, while in the latter, one uses supervised learning at the meta- level to learn an algorithm for unsupervised learning. Learning to learn can be used to learn both models and algorithms. In Andrychowicz et al. (2016) the output of meta-learning is a trained recurrent neural network (RNN), which is subsequently used as an optimization algorithm to ﬁt other models to data. In contrast, in Zoph and Le (2017) the output of meta-learning can also be an RNN model, but this new RNN is subsequently used as a model that is ﬁt to data using a classical optimizer. In both cases the output of meta-learning is an RNN, but this RNN is interpreted and applied as a model or as an algorithm. In this sense, learning to learn with neural networks blurs the classical distinction between models and algorithms. In this work, the goal of meta-learning is to produce an algorithm for global black-box optimization. Speciﬁcally, we address the problem of ﬁnding a global minimizer of an unknown (black-box) loss function f . That is, we wish to compute x ⋆ = arg min x∈X f (x) , where X is some search space of interest. The black-box function f is not available to the learner in simple closed form at test time, but can be evaluated at a query point x in the domain. This evaluation produces either deterministic or stochastic outputs y ∈ R such that f (x)= E[y | f (x)]. In other words, we can only observe the function f through unbiased noisy point-wise observations y. Bayesian optimization is one of the most popular black-box optimization methods (Brochu et al., 2009; Snoek et al., 2012; Shahriari et al., 2016). It is a sequential model- based decision making approach with two components. The ﬁrst component is a probabilistic model, consisting of a prior distribution that captures our beliefs about the be- havior of the unknown objective function and an observa- tion model that describes the data generation mechanism.