Learning Ankle-Tilt and Foot-Placement Control for Flat-footed Bipedal Balancing and Walking Bernhard Hengst Computer Science and Engineering University of New South Wales Sydney, Australia Email: bernhardh@cse.unsw.edu.au Manuel Lange Eberhard Karls University of T¨ ubingen Wilhelm-Schickard-Institut f¨ ur Informatik Email: manuel.lange@student.uni-tuebingen.de and Computer Science and Engineering University of New South Wales Sydney, Australia Brock White Computer Science and Engineering University of New South Wales Sydney, Australia Email: brockw@cse.unsw.edu.au Abstract—We learn a controller for a ﬂat-footed bipedal robot to optimally respond to both (1) external disturbances caused by, for example, stepping on objects or being pushed, and (2) rapid acceleration, such as reversal of demanded walk direction. The reinforcement learning method employed learns an optimal policy by actuating the ankle joints to assert pressure at different points along the support foot, and to determine the next swing foot placement. The controller is learnt in simulation using an inverted pendulum model and the control policy transferred and tested on two small physical humanoid robots. I. I NTRODUCTION Bipedal locomotion is often subjected to large impact forces induced by a robot inadvertently stepping on objects or by being pushed. In robotic soccer, for example, it is not uncom- mon for robots to step on each other’s feet or to be jostled by opposition players. At current RoboCup [1] competitions robots regularly fall over for these reasons in both humanoid and standard platform league matches. Another requirement in soccer environments is that bipedal robots should be able to react optimally to rapidly changing directional goals. In soccer it is often necessary to stop suddenly after walking at maximum speed or to reverse direction as quickly as possible. Reinforcement learning (RL) is a machine learning tech- nique that can learn optimal control actions given a goal speciﬁed in terms of future rewards. RL can be effective when the system dynamics are unknown, are highly non-linear or complex. The literature on bipedal walking is extensive with several approaches using RL. One approach uses neural network like function approximation to learn to walk slowly [2]. Learning takes 3 to 5 hours on a simulator. Another approach concerns itself with with frontal plane control using an actuated passive walker [3]. Velocity in the sagittal plane is simply controlled by the lean via the passive mechanism. Other RL approaches are limited to point fee that only have control effect via foot placement [4], [5], [6]. We are interested in learning a dynamically stable gait for a ﬂat-footed planar biped. This paper describes the application of RL to control the ankle-tilt to balance and accelerate the robot appropriately. When the centre-of-pressure is within the foot’s support polygon, ankle control actions applied to the support foot can be effective throughout the whole walk cycle. We can also use RL to learn the placement of the swing foot. The result is an optimal policy that arbitrates between foot pressure and foot placement actions to pursue a changing walk-speed goal in the face of disturbances. This is achieved by simultaneously actuating the ankle joint of the support foot while positioning the swing foot. The formulation explicitly uses time during the walk cycle as a part of the state description of the system. Reinforcement learning relies on many trials which makes learning directly on real robots expensive. Instead we learn the controller using a simulated inverted pendulum model that has been parameterised to closely correspond to the physical robot. The policy is then transferred to the real robot without modiﬁcation. The RL approach adopted here leaves open the ability to continue learning on the physical robot using the accumulated experience from the simulator as a starting point. In the rest of this paper we ﬁrst describe our simulated system. We then provide a brief background on reinforcement learning and outline our approach to learning on the simulated biped. The behaviour for both sudden changes in policy and impulse forces in simulation are described. We also show how the policy is implemented on two physical robots by addressing practical aspects of system state estimation and policy implementation. Finally we discuss results, related, and future work. II. SIMULATION We model the ﬂat-footed humanoid as an inverted pendulum with the pivot located at the centre of pressure along the bottom of the support foot as shown in Figure 1. We control the pivot position by actuating the ankle joint. For simulation purposes we discretise the pivot to be in one of three positions – at the toe, centre, or heel of the foot. The state s of the system is deﬁned by four variables (x, ˙ x, w, t) where x is the horizontal displacement from the centre of the support foot to the centre of mass, ˙ x is the horizontal velocity of the centre of mass, w is the horizontal