Dataset of Pages from Early Printed Books with Multiple Font Groups Mathias Seuret Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg Erlangen, Germany mathias.seuret@fau.de Saskia Limbach Gutenberg-Institut für Weltliteratur und schriftorientierte Medien Abteilung Buchwissenschaft Mainz, Germany limbach@uni-mainz.de Nikolaus Weichselbaumer Gutenberg-Institut für Weltliteratur und schriftorientierte Medien Abteilung Buchwissenschaft Mainz, Germany weichsel@uni-mainz.de Andreas Maier Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg Erlangen, Germany andreas.maier@fau.de Vincent Christlein Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg Erlangen, Germany vincent.christlein@fau.de ABSTRACT Based on contemporary scripts, early printers developed a large variety of diferent fonts. While fonts may slightly difer from one printer to another, they can be divided into font groups, such as Textura, Antiqua, or Fraktur. The recognition of font groups is important for computer scientists to select adequate OCR models, and of high interest to humanities scholars studying early printed books and the history of fonts. In this paper, we introduce a new, public dataset for the recognition of font groups in early printed books, and evaluate several state-of-the-art CNNs for the font group recognition task. The dataset consists of more than 35 600 page images, each page showing up to fve diferent font groups, of which ten are considered in this dataset. CCS CONCEPTS Information systems Clustering and classifcation; Docu- ment fltering; Presentation of retrieval results;• Computing methodologies Information extraction; Computer vision; Neural networks. KEYWORDS dataset, neural network, digital humanities, historical doc- uments, document analysis, incunabula, book history, type, typography, fonts, Antiqua, Italic, Textura, Rotunda, Gotico- Antiqua, Bastarda, Schwabacher, Fraktur, Greek, Hebrew Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. HIP’19, September 20–21, 2019, Sydney, Australia © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-9999-9/18/06…$15.00 https://doi.org/10.1145/3352631.3352640 ACM Reference Format: Mathias Seuret, Saskia Limbach, Nikolaus Weichselbaumer, An- dreas Maier, and Vincent Christlein. 2019. Dataset of Pages from Early Printed Books with Multiple Font Groups. In HIP’19 : 5th International Workshop on Historical Document Imaging and Pro- cessing, Sydney, Australia. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3352631.3352640 1 INTRODUCTION The dataset presented in this paper 1 is meant to aid the automatic recognition of font groups in scans of early modern 2 books (see Sec. 3). The primary purpose is to train and apply font-specifc OCR-models in order to improve the quality of OCR for historical documents that currently sufers greatly from the almost unmanageable variance of early modern fonts. 3 Training font group detection methods requires a lot of data, and the dataset presented in this paper is, to the authors’ knowledge, the very frst for early modern font groups. Automatic font group recognition also helps historians by enabling them to analyse how font groups were used over time and how other factors like genre, language and printing place played a role in this development. This will help to answer long-standing research questions about phenomena like the rapid establishment of Fraktur as the font for German texts from the 16 th century onwards. 2 EXISTING DATASETS As for modern fonts, there are some datasets available, such as the Adobe VFR Dataset [19], and it is also possible to generate images on the fy. However, for early modern printed documents, there is, to our best knowledge, no font group dataset comparable to ours. The most related is the one used Equal contribution 1 https://doi.org/10.5281/zenodo.3366685 2 From the middle of the 15 th century to the end of the 18 th century. 3 For more information about this approach see http://ocr-d.de.