S AMPLE -E FFICIENT C O -D ESIGN OF ROBOTIC AGENTS U SING MULTI - FIDELITY T RAINING ON U NIVERSAL P OLICY N ETWORK APREPRINT Kishan R. Nagiredla Applied Artificial Intelligence Institute (A 2 I 2 ) Deakin University, Australia knagiredla@deakin.edu.au Buddhika L. Semage Applied Artificial Intelligence Institute (A 2 I 2 ) Deakin University, Australia Thommen G. Karimpanal Applied Artificial Intelligence Institute (A 2 I 2 ) Deakin University, Australia Arun Kumar A. V Applied Artificial Intelligence Institute (A 2 I 2 ) Deakin University, Australia Santu Rana Applied Artificial Intelligence Institute (A 2 I 2 ) Deakin University, Australia ABSTRACT Co-design involves simultaneously optimizing the controller and the agent’s physical design. Its inherent bi-level optimization formulation necessitates an outer loop design optimization driven by an inner loop control optimization. This can be challenging when the design space is large and each design evaluation involves data-intensive reinforcement learning process for control optimization. To improve the sample-efficiency we propose a multi-fidelity-based design exploration strategy based on Hyperband where we tie the controllers learnt across the design spaces through a universal policy learner for warm-starting the subsequent controller learning problems. Further, we recommend a particular way of traversing the Hyperband generated design matrix that ensures that the stochasticity of the Hyperband is reduced the most with the increasing warm starting effect of the universal policy learner as it is strengthened with each new design evaluation. Experiments performed on a wide range of agent design problems demonstrate the superiority of our method compared to the baselines. Additionally, analysis of the optimized designs shows interesting design alterations including design simplifications and non-intuitive alterations that have emerged in the biological world. 1 Introduction Reinforcement Learning (RL) has been a prominent approach for training agents to learn complex behaviors, relying solely on reward maximization. This approach has shown remarkable success, as evident from humanoid robots learning to walk [1] using RL. Whilst most of the robotics research is centered around a few well-known, fixed skeleton designs e.g., robotic arms or bipedal humanoids, there is an abundance of exotic skeleton designs in nature that equip animals with unique and powerful capabilities. For example, the split hoof design of Alpine Ibex makes them excellent climbers, or the exceptionally strong hind legs make the Kangaroo rats the best jumper, etc., Unfortunately, design optimization is a hard problem because the design space can be large, and evaluating designs can be exceptionally costly, especially when the control is learned through inherently sample-intensive RL algorithms. A subset of the robot design problem that we consider in this work deals with fixed skeletal structures but with variable parameters. For example, a robot with telescopic limbs (Fig. 1) that can perform intricate tasks. [2] showed that a bipedal robot, with asymmetrical legs, offers increased stability while walking up the stairs. Such problems are often arXiv:2309.04085v1 [cs.RO] 8 Sep 2023