Scene Graph for Active Exploration in Clutered Scenario Yuhong Deng ∗ @mails.tsinghua.edu.cn Tsinghua University Haidian, Beijing, China Qie Sima ∗ smq20@mails.tsinghua.edu.cn Tsinghua University Haidian, Beijing, China Huaping Liu 2 hpliu@mail.tsinghua.edu.cn Tsinghua University Haidian, Beijing, China ABSTRACT Robotic question answering is a representative human-robot in- teraction task, where the robot must respond to human questions. Among these robotic question answering tasks, Manipulation Ques- tion Answering (MQA) requires an embodied robot to have the active exploration ability and semantic understanding ability of vision and language. MQA tasks are typically confronted with two main challenges: the semantic understanding of clutter scenes and manipulation planning based on semantics hidden in vision and language. To address the above challenges, we frst introduce a dynamic scene graph to represent the spatial relationship between objects in cluttered scenarios. Then, we propose a GRU-based struc- ture to tackle the sequence-to-sequence task of manipulation plan- ning. At each timestep, the scene graph will be updated after the robot actively explore the scenario. After thoroughly exploring the scenario, the robot can output a correct answer by searching the fnal scene graph. Extensive experiments have been conducted on tasks with diferent interaction requirements to demonstrate that our proposed framework is efective for MQA tasks. Experiments results also show that our dynamic scene graph represents seman- tics in clutter efectively and GRU-based structure performs well in the manipulation planning. CCS CONCEPTS · Computer systems organization → Embedded systems; Re- dundancy; Robotics; · Networks → Network reliability. KEYWORDS Embodied task,Active exploration,Scene graph ACM Reference Format: Yuhong Deng, Qie Sima, and Huaping Liu. 2022. Scene Graph for Active Exploration in Cluttered Scenario. In Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03ś05, 2018, Woodstock, NY . ACM, New York, NY, USA, 9 pages. https://doi.org/XXXXXXX.XXXXXXX Figure 1: After receiving the question, the robot will under- stand the semantic information hidden in clutter by the ini- tial scene graph. A manipulation sequence will be used to ac- tive explore the clutter and obtain the fnal scene graph con- tains sufcient information for answering the given ques- tion. And the robot can output the answer by searching the fnal graph. 1 INTRODUCTION People have long anticipated that one day embodied robot can directly receive human questions based on natural language and actively interact with the real environment to respond and give an answer [16], which refects intelligent interactions between robot, human, and environment [17]. Recently, active manipulation have been widely used in embodied robot tasks to enable the agent to retrieve more information from environment. Manipulation ques- tion answering (MQA) is a new proposed human-robot interaction task where the robot must perform manipulation actions to actively explore the environment to answer a given question[7]. However, proposed methods for MQA task only focus on one specifc task related to counting questions[14, 22]. Considering that the form of human-computer interaction should be varied, we extend the ∗ Both authors contributed equally to this research. 2 Corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. Conference acronym ’XX, June 03ś05, 2018, Woodstock, NY © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 https://doi.org/XXXXXXX.XXXXXXX arXiv:2207.07870v2 [cs.RO] 22 Feb 2023