Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models Bernd Bohnet * Vinh Q. Tran * Pat Verga * Roee Aharoni Daniel Andor Livio Baldini Soares Massimiliano Ciaramita Jacob Eisenstein Kuzman Ganchev Jonathan Herzig Kai Hui Tom Kwiatkowski Ji Ma Jianmo Ni Lierni Sestorain Saralegui Tal Schuster William W. Cohen Michael Collins Dipanjan Das Donald Metzler Slav Petrov Kellie Webster † Google Research Abstract Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mount- ing evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key ﬁrst step in the development of attributed LLMs. We propose a reproducible evaluation framework for the task and benchmark a broad set of architectures. We take human annota- tions as a gold standard and show that a corre- lated automatic metric is suitable for develop- ment. 1 Our experimental work gives concrete answers to two key questions (How to measure attribution?, and How well do current state-of- the-art methods perform on attribution?), and give some hints as to how to address a third (How to build LLMs with attribution?). 1 Introduction Large language models (LLMs) have shown im- pressive results across a variety of natural language understanding and generation tasks (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Rae et al., 2021; Zhang et al., 2022; Chowdhery et al., 2022; Chung et al., 2022) while requiring little or no direct supervision, 2 instead using few-shot (Brown et al., 2020) or in-context learning (Xie 1 We publicly release all system responses and their human and automatic ratings, at https://github.com/ google-research-datasets/Attributed-QA * Equal contribution. † Final author. 2 By “direct supervision” we refer to labeled examples for the speciﬁc task in mind, for example datasets such as the Natural Questions corpus (Kwiatkowski et al., 2019) for question answering. We use the term “direct supervision” to distinguish this form of supervision from the term “self supervision” sometimes used in the context of LLMs. System Input Question: what is the order of the netﬂix marvel shows? System Output Answer: Daredevil, Jessica Jones, Luke Cage, Iron Fist, The Defenders, The Punisher Attribution: (URL = A deal between Marvel and Netﬂix to produce several inter- connected series was announced in November 2013, with the individual series Daredevil (2015–2018), Jes- sica Jones (2015–2019), Luke Cage (2016–2018), and Iron Fist (2017–2018) culminating in the crossover miniseries The Defenders (2017). A spin-off from Daredevil, The Punisher (2017–2019), was ordered in April 2016. The series were all ﬁlmed in New York State, forming the state’s largest television production commitment with 161 episodes between them. [https: //en.wikipedia.org/wiki/Marvel’ s_Netflix_television_series] Figure 1: In attributed question answering the input to the model is a question, and the output from the model is an answer string together with a pointer to a short segment of text that supports that answer. et al., 2021). There is increasing evidence that LLMs may have potential in information-seeking scenarios, producing compelling output in scenar- ios ranging from “simple” question answering (e.g., Kwiatkowski et al. (2019); Rajpurkar et al. (2016); Joshi et al. (2017)), to long-form question answer- ing (Amplayo et al., 2022; Stelmakh et al., 2022), and information-seeking dialog (Thoppilan et al., 2022; Glaese et al., 2022; Shuster et al., 2022; Nakano et al., 2021). This lack of direct supervi- sion is particularly appealing given the difﬁculties of constructing labeled datasets for even simple question answering, 3 let alone more complex (but 3 Here we are referring to the traditional approach to data collection for supervised learning, where human raters provide arXiv:2212.08037v2 [cs.CL] 10 Feb 2023