Published as a conference paper at ICLR 2022 V ISION -BASED MANIPULATORS N EED TO A LSO S EE FROM T HEIR H ANDS Kyle Hsu * , Moo Jin Kim * , Rafael Rafailov, Jiajun Wu, Chelsea Finn Stanford University {kylehsu,moojink,rafailov,jiajunwu,cbfinn}@cs.stanford.edu ABSTRACT We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations. Compared with the more commonly used global third-person perspective, a hand-centric (eye-in-hand) perspective affords reduced observability, but we find that it consis- tently improves training efficiency and out-of-distribution generalization. These benefits hold across a variety of learning algorithms, experimental settings, and distribution shifts, and for both simulated and real robot apparatuses. However, this is only the case when hand-centric observability is sufficient; otherwise, in- cluding a third-person perspective is necessary for learning, but also harms out- of-distribution generalization. To mitigate this, we propose to regularize the third- person information stream via a variational information bottleneck. On six repre- sentative manipulation tasks with varying hand-centric observability adapted from the Meta-World benchmark, this results in a state-of-the-art reinforcement learn- ing agent operating from both perspectives improving its out-of-distribution gen- eralization on every task. While some practitioners have long put cameras in the hands of robots, our work systematically analyzes the benefits of doing so and provides simple and broadly applicable insights for improving end-to-end learned vision-based robotic manipulation. 1 Figure 1: Illustration suggesting the role that visual perspective can play in facilitating the acquisition of sym- metries with respect to certain transformations on the world state s. T0: planar translation of the end-effector and cube. T1: vertical translation of the table surface, end-effector, and cube. T2: addition of distractor objects. O3: third-person perspective. O h : hand-centric perspective. 1 I NTRODUCTION Physical manipulation is so fundamental a skill for natural agents that it has been described as a “Rosetta Stone for cognition” (Ritter & Haschke, 2015). How can we endow machines with similar * Co-first authorship. Order determined by coin flip. 1 Project website: https://sites.google.com/view/seeing-from-hands. 1 arXiv:2203.12677v1 [cs.RO] 15 Mar 2022