Overview of the seventh Dialog System Technology Challenge: DSTC7 Luis Fernando D’Haro* ,a , Koichiro Yoshino b , Chiori Hori c , Tim K. Marks c , Lazaros Polymenakos d , Jonathan K. Kummerfeld e , Michel Galley f , Xiang Gao f a Speech Technology Group. Information Processing and Telecommunications Center (IPTC), ETSI Telecomunicaci on Universidad Polit ecnica de Madrid, Ciudad Universitaria, Av. Complutense, 30, Madrid 28040, Spain b Nara Institute of Science and Technology, Ikoma, Nara, 6300192, Japan c Mitsubishi Electric Research Laboratories (MERL), 201 Broadway, Cambridge, MA, 02139, USA d Alexa Dialog Science, 101 Main Street, Cambridge, MA, 02142, USA e University of Michigan, 2260 Hayward Street, Ann Arbor, MI 48109, USA f Microsoft Research, One Microsoft Way, Redmond, WA, 98052, USA ARTICLE INFO Article History: Received 30 July 2019 Accepted 2 January 2020 Available online 15 January 2020 ABSTRACT This paper provides detailed information about the seventh Dialog System Technology Chal- lenge (DSTC7) and its three tracks aimed to explore the problem of building robust and accu- rate end-to-end dialog systems. In more detail, DSTC7 focuses on developing and exploring end-to-end technologies for the following three pragmatic challenges: (1) sentence selection for multiple domains, (2) generation of informational responses grounded in external knowl- edge, and (3) audio visual scene-aware dialog to allow conversations with users about objects and events around them. This paper summarizes the overall setup and results of DSTC7, including detailed descrip- tions of the different tracks, provided datasets and annotations, overview of the submitted systems and their ﬁnal results. For Track 1, LSTM-based models performed best across both datasets, allowing teams to effectively handle task variants where no correct answer was present or when multiple paraphrases were included. For Track 2, RNN-based architectures augmented to incorporate facts by using two types of encoders: a dialog encoder and a fact encoder plus using attention mechanisms and a pointer-generator approach provided the best results. Finally, for Track 3, the best model used Hierarchical Attention mechanisms to combine the text and vision information obtaining a 22% better result than the baseline LSTM system for the human rating score. More than 220 participants were registered and about 40 teams participated in the ﬁnal challenge. 32 scientiﬁc papers reporting the systems submitted to DSTC7, and 3 general tech- nical papers for dialog technologies, were presented during the one-day wrap-up workshop at AAAI-19. During the workshop, we reviewed the state-of-the-art systems, shared novel approaches to the DSTC7 tasks, and discussed the future directions for the challenge (DSTC8). © 2020 Elsevier Ltd. All rights reserved. Keywords: Dialog System Technology Challenge end-to-end dialog systems Sentence Selection Natural Language Generation Audio Visual Scene-Aware Dialog Every author has equal contribution. http://workshop.colips.org/dstc7. *Corresponding author. E-mail address: luisfernando.dharo@upm.es (L.F. D’Haro). https://doi.org/10.1016/j.csl.2020.101068 0885-2308/© 2020 Elsevier Ltd. All rights reserved. Computer Speech & Language 62 (2020) 101068 Contents lists available at ScienceDirect Computer Speech & Language journal homepage: www.elsevier.com/locate/csl