Interactive Mobile App Navigation with Uncertain or Under-specified Natural Language Commands Andrea Burns 1 Deniz Arsan 2 Sanjna Agrawal 1 Ranjitha Kumar 2 Kate Saenko 1,3 1 Boston University, MA 3 MIT-IBM Watson AI Lab, MA {aburns4, sanjna, saenko, bplum}@bu.edu Bryan A. Plummer 1 2 University of Illinois at Urbana-Champaign, IL {darsan2, ranjitha}@illinois.edu Abstract We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a new dataset where the goal is to complete a nat- ural language query in a mobile app. Current datasets for related tasks in interactive question answering, visual com- mon sense reasoning, and question-answer plausibility pre- diction do not support research in resolving ambiguous nat- ural language requests or operating in diverse digital do- mains. As a result, they fail to capture complexities of real question answering or interactive tasks. In contrast, MoTIF contains natural language requests that are not satisfiable, the first such work to investigate this issue for interactive vision-language tasks. MoTIF also contains follow up ques- tions for ambiguous queries to enable research on task un- certainty resolution. We introduce task feasibility prediction and propose an initial model which obtains an F1 score of 61.1. We next benchmark task automation with our dataset and find adaptations of prior work perform poorly due to our realistic language requests, obtaining an accuracy of only 20.2% when mapping commands to grounded actions. We analyze performance and gain insight for future work that may bridge the gap between current model ability and what is needed for successful use in application. 1. Introduction Vision-language tasks often require high-level reasoning skills like counting, comparison, and common sense knowl- edge to relate visual and language data [6, 10, 11, 13, 34]. The goal of these tasks has been to employ reliable and robust vision-language reasoning, but prior work has failed to cre- ate AI agents that can interact with humans naturally and handle realistic use cases. For example, vision-language models fail to recognize when a visual question cannot be Figure 1. Example MoTIF tasks. Task inputs are free form natural language commands which may not be possible in the app environ- ment. At each time step, a task demonstration consists of action coordinates (i.e., where clicking, typing, or scrolling occurs) and the app screen and view hierarchy (illustrated behind it). answered given the scene being viewed. Instead, these mod- els provide visually unrelated, yet plausible answers, like answering ‘white’ to the question ‘what color is the remote control?’ when no remote is present in the input image [23]. This is particularly dangerous for users that are limited in their ability to determine if an answer is trustworthy, either physically or situationally, e.g., users that are low-vision or driving. While prior work has explored question rele- vance for text-only [10] and visual question answering [27], they have focused on the extreme case of language being completely unrelated to the visual input it is paired with. VizWiz [13] introduced a visual question answering dataset for images taken by people that are blind, resulting in ques- tions which may not be answerable from the captured im- 1 arXiv:2202.02312v1 [cs.CL] 4 Feb 2022