Making Test Corpora for Question Answering More Representative Andrew Walker 1 , Andrew Starkey 2 , Jeﬀ Z. Pan 1 , and Advaith Siddharthan 1 1 Computing Science, University of Aberdeen, UK [r05aw0,jeff.z.pan,advaith] @abdn.ac.uk 2 Engineering, University of Aberdeen, UK [a.starkey] @abdn.ac.uk Abstract. Despite two high proﬁle series of challenges devoted to ques- tion answering technologies there remains no formal study into the rep- resentativeness that question corpora bear to real end-user inputs. We examine the corpora used presently and historically in the TREC and QALD challenges in juxtaposition with two more from natural sources and identify a degree of disjointedness between the two. We analyse these diﬀerences in depth before discussing a candidate approach to question corpora generation and provide a juxtaposition on its own representa- tiveness. We conclude that these artiﬁcial corpora have good overall cov- erage of grammatical structures but the distribution is skewed, meaning performance measures may be inaccurate. 1 Introduction Question Answering (QA) technologies were envisioned early on in the artiﬁcial intelligence community. At least 15 experimental English language QA systems were described by [13]. Notable early attempts include BASEBALL [11] and LUNAR [17,18]. New technologies and resources often prompt a new wave of QA solutions using them. For example: relational databases [8] with PLANES [16]; the semantic web [2] by [3]; and Wikipedia [15] by [7]. Attempts to evaluate QA technologies are similarly diverse. The long-running Text REtrieval Conferences 3 (TREC) making use of human assessors in conjunc- tion with a nugget pyramid method [12], while the newer Question Answering over Linked Data 4 (QALD) series uses an automated process that compares results with a gold standard. In both cases, however, the matter of whether or not the questions being posed to the challenge participants actually capture the range and diversity of questions that real users would make of a QA system is not addressed. We ex- plore the distribution of grammatical relationships present in various artiﬁcial and natural question corpora in two primary aspects: coverage and representa- tiveness. Coverage is important for QA solution developers to gauge gaps in their 3 http://trec.nist.gov/ 4 http://greententacle.techfak.uni-bielefeld.de/ ~ cunger/qald/