TWIZ: The Multimodal Conversational Task Wizard Rafael Ferreira Diogo Silva Diogo Tavares Frederico Vicente rah.ferreira@campus.fct.unl.pt dmgc.silva@campus.fct.unl.pt dc.tavares@campus.fct.unl.pt fm.vicente@campus.fct.unl.pt Universidade NOVA de Lisboa Lisbon, Portugal Mariana Bonito Gustavo Gonçalves Rui Margarido Paula Figueiredo m.bonito@campus.fct.unl.pt gs.goncalves@campus.fct.unl.pt rp.margarido@campus.fct.unl.pt pc.mestre@campus.fct.unl.pt Universidade NOVA de Lisboa Lisbon, Portugal Helder Rodrigues David Semedo Joao Magalhaes harr@campus.fct.unl.pt df.semedo@fct.unl.pt jm.magalhaes@fct.unl.pt Universidade NOVA de Lisboa Lisbon, Portugal Figure 1: (a) Task Grounding; (b) Task Overview; (c) Step Instructions. ABSTRACT This paper introduces TWIZ, a multimodal conversational task wiz- ard that supports an engaging experience, where users are guided through a multimodal conversation, towards the successful com- pletion of recipes and DIY tasks. TWIZ leverages task guides from WikiHow and Recipe sources, as well as dialog AI-based methods to deliver a rich, compelling, and engaging experience when guiding users through complex manual tasks. TWIZ participated in the Amazon Alexa Prize Taskbot 2021 [3]. Demo Video. CCS CONCEPTS · Computing methodologies → Natural language processing; Intelligent agents;· Human-centered computing → Interac- tive systems and tools. KEYWORDS Conversation Assistants, Multimodal, Artifcial Intelligence, Natu- ral Language Processing ACM Reference Format: Rafael Ferreira, Diogo Silva, Diogo Tavares, Frederico Vicente, Mariana Bonito, Gustavo Gonçalves, Rui Margarido, Paula Figueiredo, Helder Ro- drigues, David Semedo, and Joao Magalhaes. 2022. TWIZ: The Multimodal Conversational Task Wizard. In Proceedings of the 30th ACM International Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). MM ’22, October 10–14, 2022, Lisboa, Portugal © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9203-7/22/10. https://doi.org/10.1145/3503161.3547741 Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3503161.3547741 1 MULTIMODAL TASK GUIDING AGENTS Research in multimodal conversational AI is currently tackling the problem of complex manual tasks such as Cooking and DIY. In this novel setting, task complexity needs to be managed and delivered in a controlled manner to efectively assist the user. The core idea is to guide users in an engaging manner by distributing cognitive load throughout the task, thus seeking a symbiosis between task completion and a multimodal curiosity-exploration paradigm. The key novelties and application design principles are: Conversational task-grounding. Robust intent detection and retrieval component, tailored for voice-based conversational agents, to quickly fnd the desired task and minimize user frustration - Figure 1 (a); Task organization and presentation. Task steps are decomposed and automatically illustrated with images, thus improving clarity and providing instructional visual cues. When available, we use videos to complement task instructions - Figure 1 (c); Conversational engagement. Novel features aimed at keeping users engaged throughout the conversation: 3D tasks visual preview, task- specifc curiosities through dialog system initiative, and support for contextual question-answering. 2 CONVERSATIONAL SYSTEM DESIGN Our architecture is presented in Figure 2. It is based on the CoBot framework [7] provided by Amazon. This framework runs the main fow of the agent on a server-less cloud function, and uses a database to save the dialog turns and user state. The ML algorithms are run 6997