The Graphic Narrative Corpus (GNC): Design, Annotation, and Analysis for the Digital Humanities Alexander Dunst Dept. of English and American Studies University of Paderborn Paderborn, Germany dunst@mail.uni-paderborn.de Rita Hartel Dept. of Computer Science University of Paderborn Paderborn, Germany rst@uni-paderborn.de Jochen Laubrock Dept. of Psychology University of Potsdam Potsdam, Germany laubrock@uni-potsdam.de Abstract—Developed for an interdisciplinary DH project, the Graphic Narrative Corpus (GNC) is the first digital corpus of graphic novels, memoirs, and non-fiction written in English. It currently includes 160 book-length titles and will grow to around 250 graphic narratives by 2018. In contrast to collections such as Manga109, the eBDtheque, and the Iyyer corpus, the GNC was conceived to serve both the research needs of humanities and social science scholars and as a data set for computational analysis. The GNC was constructed as a stratified monitor corpus that balances different historical periods, geographical origin, literary genres, and the gender and ethnic background of authors. Based on an extension of John Walsh’s XML-dialect CBML and editor software developed for the corpus, annotation combines a focus on the first ten pages of each title and sample annotation of full-length books. XML-annotation currently includes visual objects, as well as word-image and character relations (panels, characters, balloons, captions, text, interaction types). In addition, we also provide eye-tracking data for annotated titles. Information about the corpus and sample visualizations can be found at: https://groups.uni- paderborn.de/graphic-literature/gncorpus/corpus.php. Keywords — Corpus, Graphic Novels, Annotation, Eye Tracking I. INTRODUCTION Research on comics has undergone sustained growth over the last two decades in several disciplines and has now become a highly diverse field of inquiry. Although there are wordless and abstract comics, the medium’s complex combination of words and images in telling stories has drawn the most sustained interest. Recent advances in image analysis and the explosive growth of the digital humanities (DH), mean that considerable efforts are underway to advance the computational analysis of comics. Several corpora are now available, including the Manga109 data set [1], the Iyyer corpus [2], and the eBDtheque [3]. These corpora focus on different formats and national traditions: Japanese manga in the first case, US-American comic books from the so-called Golden Age of the 1930 and ‘40s in the second case, and a mixture of these, plus French bandes déssinées, in the latter. They also vary widely in size. While the eBDtheque only runs to 100 pages, the Manga109 corpus contains as many comic books with a total of 21,000 pages [3] [1]. In contrast, the Iyyer corpus “contains ~1,2 million panels drawn from almost 4,000 publicly available comic books” [2]. Despite their considerable differences, all these corpora share that they have been assembled by computer scientists. Every disciplinary perspective leads researchers in those fields to make certain choices and to exclude others. Computer science is no exception in this regard. As a consequence, existing comics corpora may prove extremely valuable for digital humanists in some cases but also carry distinct disadvantages for research that brings together computational and humanistic or social science aims. Section II describes how comics corpora can be designed to appeal to researchers in both of these fields. Such a broader audience may not be necessary, or even advantageous, in all cases. Yet, as computational research on comics looks to tackle more complex aspects of the medium – including visual object recognition, text-image relations, and scene understanding – the media-specific knowledge accrued over decades of study in the humanities becomes of increasing value to computer science. Rule-based approaches to tasks such as panel and text recognition may also benefit from the descriptive scholarship of humanities research. Sections III and IV introduce the XML-annotation language and visual editor developed for the GNC. The latter section also describes the editor’s general design, automatic features, and computer-aided annotation tools. Section V then provides details of eye-tracking data recorded for our corpus, before the final two sections present brief overviews of potential applications across a number of disciplines and discuss future work as we look to expand our corpus and annotations. II. CORPUS DESIGN The basis for any valid statement about a cultural format, including comics, is a balanced and representative selection of texts that has been collected according to a transparent sampling regime. This means that texts included in a stratified corpus should reflect the numbers and different types of text that exist of a certain format, be it manga, graphic narratives, or Franco-Belgian comics. In order to achieve this aim, collection must be based on a clear definition of what will be included and what will not. In practice, this amount to defining what the researchers involved in the corpus design understand as the central characteristics of a certain type of text. To use the example of the GNC: Graphic narratives represent one of the most popular, and culturally the most prestigious, form of comics production in North America, much of Europe, and Latin America. Authors such as Art Spiegelman, Chris Ware, Marjane Sartrapi, and Alison Bechdel are recognized as major artists, and their works are regularly taught at schools and universities, or adapted into successful plays or films. For the 2017 14th IAPR International Conference on Document Analysis and Recognition 2379-2140/17 $31.00 © 2017 IEEE DOI 10.1109/ICDAR.2017.286 15 2017 14th IAPR International Conference on Document Analysis and Recognition 2379-2140/17 $31.00 © 2017 IEEE DOI 10.1109/ICDAR.2017.286 15 2017 14th IAPR International Conference on Document Analysis and Recognition 2379-2140/17 $31.00 © 2017 IEEE DOI 10.1109/ICDAR.2017.286 15 2017 14th IAPR International Conference on Document Analysis and Recognition 2379-2140/17 $31.00 © 2017 IEEE DOI 10.1109/ICDAR.2017.286 15 2017 14th IAPR International Conference on Document Analysis and Recognition 2379-2140/17 $31.00 © 2017 IEEE DOI 10.1109/ICDAR.2017.286 15