Building worksets for scholarship by linking complementary corpora Kevin Page kevin.page@oerc.ox.ac.uk University of Oxford, United Kingdom Terhi Nurmikko-Fuller terhi.nurmikko-fuller@anu.edu.au University of Oxford, United Kingdom Timothy Cole t-cole3@illinois.edu University of Illinois, United States of America J. Stephen Downie jdownie@illinois.edu University of Illinois, United States of America Background and General Motivation The HathiTrust Digital Library The HathiTrust Digital Library (HTDL) comprises digitized representations of 15.1 million volumes: ap- proximately 7.47 million book titles, 418,216 serial ti- tles, and 5.3 billion pages, across 460 languages. HTDL is best described as “a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future”. The HathiTrust Research Center (HTRC) develops software models, tools, and infrastructure to help dig- ital humanities (DH) scholars conduct new computa- tional analyses of works in the HTDL. For many schol- ars the size of the HTDL corpus is both attractive and daunting: many existing DH tools are designed for smaller collections, and many research inquiries are facilitated by more focused, homogeneous collections of texts (Gibbs and Owens, 2012). Worksets In many, if not most, DH research endeavours, per- forming an analytical task across the whole HTDL is neither practical nor productive (Kambatla et al., 2014). For example, a tool trained to identify genre at- tributes of 18th century English language prose fiction may not be applicable to 20th century French poetry. The first step is to identify the subset -- of works, edi- tions, volumes, chapters, pages -- to set an initial inves- tigative scope and, indeed, subsequent iterative refine- ments of a subset as research proceeds. In a corpus as large and complex as the HTDL, finding materials and then defining the sought after subset can be extraordi- narily difficult. HTRC has come to call collections of digital items brought together by a scholar for her analyses a “workset”, created to help the researcher build, ma- nipulate, iteratively define and compare their collec- tions. Reflecting upon input and advice from the DH community, Jett (2015) defines a workset as a ma- chine-actionable research collection realised as: 1. An aggregation of members (volumes, pages, etc.); 2. Metadata intrinsic to the workset’s essential nature (e.g., creator, selection criteria); 3. Metadata intrinsic to digital architectures (i.e. creation date & number of members); 4. Metadata supportive of human interactions (i.e. title & description); 5. Derivative metadata from workset members (e.g. format(s), language(s), etc.); and, 6. Metadata concerning workset provenance (e.g. derived from, used by, etc.). Broadly, item 1 identifies the actual data used in an analysis; whereas the remaining metadata items de- scribe the workset itself, aiding workset management throughout the research cycle. Cross-corpus worksets As alluded above, numerous criteria can be used to select the constituents of a workset; and several tech- nological implementations could, in theory, realise worksets. In researching the design and realisation of worksets and associated tooling, we are also mindful to remain grounded in their practical application and the needs of scholarly users. We have therefore under- taken our work through discipline-based scenarios in which we can explore the strengths and weaknesses of the HTDL viewed through the prism of worksets. We report one such exploration here, questioning whether (relatively) small, well explored, and well un- derstood corpora can be superimposed over the HTDL to aid navigation and investigation of the much larger and superficially understood HTDL collection? From a system perspective, a cross-corpus workset requires exposing compatible metadata (items 2-6 above) from multiple collections, first used to align