A metadata infrastructure for the analysis of parliamentary proceedings Richard Gartner Centre for e-Research, Department of Digital Humanities King's College London London, United Kingdom richard.gartner@kcl.ac.uk Abstract—This work-in-progress article discusses DILIPAD (Digging into Linked Parliamentary Data), a project funded under the Digging Into Data Challenge. DILIPAD aims to create an extensive corpus of structured XML data of parliamentary proceedings from three countries (United Kingdom, Netherlands and Canada) in order to enable large-scale diachronic analyses of their content. The corpora integrate the textual data of proceedings within contextual metadata encoded in the XML schema Parliamentary Metadata Language (PML). The article discusses the background to the project, the construction of the corpora and highlights they ways in which they may be used for quantitative and qualitative analysis. Keywords—metadata; corpus analysis; parliamentary history; XML I. INTRODUCTION Although much of the data which first brought the concept of Big Data to attention originated in the sciences, its applicability to the humanities is currently being explored in greater depth than previously. The concept itself is subject to a variety of definitions, but most tend to agree on the distinctive features highlighted by Ward and Baker [1, p. 2]: size, complexity and technology (the last being the development and use of tools capable of processing large, complex datasets). Although the datasets in humanities Big Data projects are often smaller than those originating in the sciences [2, p. 462], its complexity and the need for new tools and techniques to handle this are just as demanding. Meeting these challenges in the main rationale behind the "Digging into Data Challenge"[3].This "challenge", an open competition for innovative projects in large-scale data analysis in the humanities and social sciences, has funded the DILIPAD (Digging into Linked Parliamentary Data) [4] project which is the subject of this work-in-progress paper. This project aims to develop new methodologies for the qualitative analysis of large volumes of records of legislative proceedings in three countries (United Kingdom, Canada and the Netherlands). It is attempting to do so by producing structured text corpora from these records and devising techniques, analogous to those already used in corpus linguistics, to analyse these across temporally and geographically diverse ranges of data. II. BACKGROUND A. Digitizing Parliamentary Data A large corpus of parliamentary proceedings has been available in all three countries covered by the Dilipad project for some years. In the United Kingdom, the full text of proceedings has been digitized from 1803 onwards in a number of notable projects, including British History Online [5] and the UK Parliament's own conversion of the Hansard record of parliamentary debates [6]. In the Netherlands, two hundred years of proceedings have been digitized [7, p. 3671] as part of DutchParl, a corpus of Dutch-language proceedings [7]. In Canada, scanned images of proceedings date as far back as 1867 [7] although machine-readable texts of these only go back to the 1990s. These diverse projects have produced large collections of machine-readable texts, but these conform to a diverse set of encoding standards which renders large-scale analysis of their contents difficult to achieve. The UK's Hansard project, for instance, uses XML files conforming to in-house schemata which undergo several revisions over the period of coverage. The Netherlands data conform to an in-project schema, Politicalmashup [9], while the Canadian data conforms to TxtMap, a schema devised in-house to represent the text of single page images with limited semantic content [9]. Other projects, such as a recent project to scan six years of proceedings from the Estonian parliament, use the TEI (Text Encoding Initiative) [10]. This heterogeneity of approaches to encoding limits the analytical potential of each collection and severely curtails their potential for cross-collection analysis. Generic bibliographic metadata schemas, such as the TEI Header or the more sophisticated MODS (Metadata Object Description Schema), do not offer sufficiently specific semantics to describe these proceedings adequately. Of the bespoke XML applications used by the projects noted above, only the Politicalmashup schema is devised specifically for textual analysis, and this is itself limited to a narrow range (specifically the structure of proceedings in terms of speeches, interruptions, interventions etc). To allow more sophisticated querying of the record and qualitative analyses to be carried out requires the use of a new more generic schema.