TWIZ: The Multimodal Conversational Task Wizard
Rafael Ferreira
Diogo Silva
Diogo Tavares
Frederico Vicente
rah.ferreira@campus.fct.unl.pt
dmgc.silva@campus.fct.unl.pt
dc.tavares@campus.fct.unl.pt
fm.vicente@campus.fct.unl.pt
Universidade NOVA de Lisboa
Lisbon, Portugal
Mariana Bonito
Gustavo Gonçalves
Rui Margarido
Paula Figueiredo
m.bonito@campus.fct.unl.pt
gs.goncalves@campus.fct.unl.pt
rp.margarido@campus.fct.unl.pt
pc.mestre@campus.fct.unl.pt
Universidade NOVA de Lisboa
Lisbon, Portugal
Helder Rodrigues
David Semedo
Joao Magalhaes
harr@campus.fct.unl.pt
df.semedo@fct.unl.pt
jm.magalhaes@fct.unl.pt
Universidade NOVA de Lisboa
Lisbon, Portugal
Figure 1: (a) Task Grounding; (b) Task Overview; (c) Step Instructions.
ABSTRACT
This paper introduces TWIZ, a multimodal conversational task wiz-
ard that supports an engaging experience, where users are guided
through a multimodal conversation, towards the successful com-
pletion of recipes and DIY tasks. TWIZ leverages task guides from
WikiHow and Recipe sources, as well as dialog AI-based methods to
deliver a rich, compelling, and engaging experience when guiding
users through complex manual tasks. TWIZ participated in the
Amazon Alexa Prize Taskbot 2021 [3]. Demo Video.
CCS CONCEPTS
· Computing methodologies → Natural language processing;
Intelligent agents;· Human-centered computing → Interac-
tive systems and tools.
KEYWORDS
Conversation Assistants, Multimodal, Artifcial Intelligence, Natu-
ral Language Processing
ACM Reference Format:
Rafael Ferreira, Diogo Silva, Diogo Tavares, Frederico Vicente, Mariana
Bonito, Gustavo Gonçalves, Rui Margarido, Paula Figueiredo, Helder Ro-
drigues, David Semedo, and Joao Magalhaes. 2022. TWIZ: The Multimodal
Conversational Task Wizard. In Proceedings of the 30th ACM International
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
MM ’22, October 10–14, 2022, Lisboa, Portugal
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9203-7/22/10.
https://doi.org/10.1145/3503161.3547741
Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal.
ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3503161.3547741
1 MULTIMODAL TASK GUIDING AGENTS
Research in multimodal conversational AI is currently tackling the
problem of complex manual tasks such as Cooking and DIY. In this
novel setting, task complexity needs to be managed and delivered
in a controlled manner to efectively assist the user. The core idea
is to guide users in an engaging manner by distributing cognitive
load throughout the task, thus seeking a symbiosis between task
completion and a multimodal curiosity-exploration paradigm. The
key novelties and application design principles are:
Conversational task-grounding. Robust intent detection and retrieval
component, tailored for voice-based conversational agents, to quickly
fnd the desired task and minimize user frustration - Figure 1 (a);
Task organization and presentation. Task steps are decomposed and
automatically illustrated with images, thus improving clarity and
providing instructional visual cues. When available, we use videos
to complement task instructions - Figure 1 (c);
Conversational engagement. Novel features aimed at keeping users
engaged throughout the conversation: 3D tasks visual preview, task-
specifc curiosities through dialog system initiative, and support
for contextual question-answering.
2 CONVERSATIONAL SYSTEM DESIGN
Our architecture is presented in Figure 2. It is based on the CoBot
framework [7] provided by Amazon. This framework runs the main
fow of the agent on a server-less cloud function, and uses a database
to save the dialog turns and user state. The ML algorithms are run
6997