The Graphic Narrative Corpus (GNC): Design,
Annotation, and Analysis for the Digital Humanities
Alexander Dunst
Dept. of English and American Studies
University of Paderborn
Paderborn, Germany
dunst@mail.uni-paderborn.de
Rita Hartel
Dept. of Computer Science
University of Paderborn
Paderborn, Germany
rst@uni-paderborn.de
Jochen Laubrock
Dept. of Psychology
University of Potsdam
Potsdam, Germany
laubrock@uni-potsdam.de
Abstract—Developed for an interdisciplinary DH project, the
Graphic Narrative Corpus (GNC) is the first digital corpus of
graphic novels, memoirs, and non-fiction written in English. It
currently includes 160 book-length titles and will grow to around
250 graphic narratives by 2018. In contrast to collections such as
Manga109, the eBDtheque, and the Iyyer corpus, the GNC was
conceived to serve both the research needs of humanities and
social science scholars and as a data set for computational
analysis. The GNC was constructed as a stratified monitor
corpus that balances different historical periods, geographical
origin, literary genres, and the gender and ethnic background of
authors. Based on an extension of John Walsh’s XML-dialect
CBML and editor software developed for the corpus, annotation
combines a focus on the first ten pages of each title and sample
annotation of full-length books. XML-annotation currently
includes visual objects, as well as word-image and character
relations (panels, characters, balloons, captions, text, interaction
types). In addition, we also provide eye-tracking data for
annotated titles. Information about the corpus and sample
visualizations can be found at: https://groups.uni-
paderborn.de/graphic-literature/gncorpus/corpus.php.
Keywords — Corpus, Graphic Novels, Annotation, Eye Tracking
I. INTRODUCTION
Research on comics has undergone sustained growth over
the last two decades in several disciplines and has now become
a highly diverse field of inquiry. Although there are wordless
and abstract comics, the medium’s complex combination of
words and images in telling stories has drawn the most
sustained interest. Recent advances in image analysis and the
explosive growth of the digital humanities (DH), mean that
considerable efforts are underway to advance the
computational analysis of comics. Several corpora are now
available, including the Manga109 data set [1], the Iyyer
corpus [2], and the eBDtheque [3]. These corpora focus on
different formats and national traditions: Japanese manga in the
first case, US-American comic books from the so-called
Golden Age of the 1930 and ‘40s in the second case, and a
mixture of these, plus French bandes déssinées, in the latter.
They also vary widely in size. While the eBDtheque only runs
to 100 pages, the Manga109 corpus contains as many comic
books with a total of 21,000 pages [3] [1]. In contrast, the Iyyer
corpus “contains ~1,2 million panels drawn from almost 4,000
publicly available comic books” [2].
Despite their considerable differences, all these corpora
share that they have been assembled by computer scientists.
Every disciplinary perspective leads researchers in those fields
to make certain choices and to exclude others. Computer
science is no exception in this regard. As a consequence,
existing comics corpora may prove extremely valuable for
digital humanists in some cases but also carry distinct
disadvantages for research that brings together computational
and humanistic or social science aims.
Section II describes how comics corpora can be designed to
appeal to researchers in both of these fields. Such a broader
audience may not be necessary, or even advantageous, in all
cases. Yet, as computational research on comics looks to tackle
more complex aspects of the medium – including visual object
recognition, text-image relations, and scene understanding –
the media-specific knowledge accrued over decades of study in
the humanities becomes of increasing value to computer
science. Rule-based approaches to tasks such as panel and text
recognition may also benefit from the descriptive scholarship
of humanities research. Sections III and IV introduce the
XML-annotation language and visual editor developed for the
GNC. The latter section also describes the editor’s general
design, automatic features, and computer-aided annotation
tools. Section V then provides details of eye-tracking data
recorded for our corpus, before the final two sections present
brief overviews of potential applications across a number of
disciplines and discuss future work as we look to expand our
corpus and annotations.
II. CORPUS DESIGN
The basis for any valid statement about a cultural format,
including comics, is a balanced and representative selection of
texts that has been collected according to a transparent
sampling regime. This means that texts included in a stratified
corpus should reflect the numbers and different types of text
that exist of a certain format, be it manga, graphic narratives, or
Franco-Belgian comics. In order to achieve this aim, collection
must be based on a clear definition of what will be included
and what will not. In practice, this amount to defining what the
researchers involved in the corpus design understand as the
central characteristics of a certain type of text. To use the
example of the GNC: Graphic narratives represent one of the
most popular, and culturally the most prestigious, form of
comics production in North America, much of Europe, and
Latin America. Authors such as Art Spiegelman, Chris Ware,
Marjane Sartrapi, and Alison Bechdel are recognized as major
artists, and their works are regularly taught at schools and
universities, or adapted into successful plays or films. For the
2017 14th IAPR International Conference on Document Analysis and Recognition
2379-2140/17 $31.00 © 2017 IEEE
DOI 10.1109/ICDAR.2017.286
15
2017 14th IAPR International Conference on Document Analysis and Recognition
2379-2140/17 $31.00 © 2017 IEEE
DOI 10.1109/ICDAR.2017.286
15
2017 14th IAPR International Conference on Document Analysis and Recognition
2379-2140/17 $31.00 © 2017 IEEE
DOI 10.1109/ICDAR.2017.286
15
2017 14th IAPR International Conference on Document Analysis and Recognition
2379-2140/17 $31.00 © 2017 IEEE
DOI 10.1109/ICDAR.2017.286
15
2017 14th IAPR International Conference on Document Analysis and Recognition
2379-2140/17 $31.00 © 2017 IEEE
DOI 10.1109/ICDAR.2017.286
15