Dataset of Pages from Early Printed Books
with Multiple Font Groups
Mathias Seuret
∗
Pattern Recognition Lab,
Friedrich-Alexander-Universität
Erlangen-Nürnberg
Erlangen, Germany
mathias.seuret@fau.de
Saskia Limbach
∗
Gutenberg-Institut für
Weltliteratur und schriftorientierte
Medien Abteilung
Buchwissenschaft
Mainz, Germany
limbach@uni-mainz.de
Nikolaus Weichselbaumer
∗
Gutenberg-Institut für
Weltliteratur und schriftorientierte
Medien Abteilung
Buchwissenschaft
Mainz, Germany
weichsel@uni-mainz.de
Andreas Maier
Pattern Recognition Lab,
Friedrich-Alexander-Universität
Erlangen-Nürnberg
Erlangen, Germany
andreas.maier@fau.de
Vincent Christlein
Pattern Recognition Lab,
Friedrich-Alexander-Universität
Erlangen-Nürnberg
Erlangen, Germany
vincent.christlein@fau.de
ABSTRACT
Based on contemporary scripts, early printers developed a
large variety of diferent fonts. While fonts may slightly difer
from one printer to another, they can be divided into font
groups, such as Textura, Antiqua, or Fraktur. The recognition
of font groups is important for computer scientists to select
adequate OCR models, and of high interest to humanities
scholars studying early printed books and the history of fonts.
In this paper, we introduce a new, public dataset for the
recognition of font groups in early printed books, and evaluate
several state-of-the-art CNNs for the font group recognition
task. The dataset consists of more than 35 600 page images,
each page showing up to fve diferent font groups, of which
ten are considered in this dataset.
CCS CONCEPTS
• Information systems → Clustering and classifcation; Docu-
ment fltering; Presentation of retrieval results;• Computing
methodologies → Information extraction; Computer vision;
Neural networks.
KEYWORDS
dataset, neural network, digital humanities, historical doc-
uments, document analysis, incunabula, book history, type,
typography, fonts, Antiqua, Italic, Textura, Rotunda, Gotico-
Antiqua, Bastarda, Schwabacher, Fraktur, Greek, Hebrew
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for proft or commercial advantage
and that copies bear this notice and the full citation on the frst
page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specifc permission and/or a fee. Request permissions
from permissions@acm.org.
HIP’19, September 20–21, 2019, Sydney, Australia
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-9999-9/18/06…$15.00
https://doi.org/10.1145/3352631.3352640
ACM Reference Format:
Mathias Seuret, Saskia Limbach, Nikolaus Weichselbaumer, An-
dreas Maier, and Vincent Christlein. 2019. Dataset of Pages from
Early Printed Books with Multiple Font Groups. In HIP’19 : 5th
International Workshop on Historical Document Imaging and Pro-
cessing, Sydney, Australia. ACM, New York, NY, USA, 6 pages.
https://doi.org/10.1145/3352631.3352640
1 INTRODUCTION
The dataset presented in this paper
1
is meant to aid the
automatic recognition of font groups in scans of early modern
2
books (see Sec. 3). The primary purpose is to train and apply
font-specifc OCR-models in order to improve the quality of
OCR for historical documents that currently sufers greatly
from the almost unmanageable variance of early modern
fonts.
3
Training font group detection methods requires a
lot of data, and the dataset presented in this paper is, to
the authors’ knowledge, the very frst for early modern font
groups.
Automatic font group recognition also helps historians by
enabling them to analyse how font groups were used over
time and how other factors like genre, language and printing
place played a role in this development. This will help to
answer long-standing research questions about phenomena
like the rapid establishment of Fraktur as the font for German
texts from the 16
th
century onwards.
2 EXISTING DATASETS
As for modern fonts, there are some datasets available, such
as the Adobe VFR Dataset [19], and it is also possible to
generate images on the fy. However, for early modern printed
documents, there is, to our best knowledge, no font group
dataset comparable to ours. The most related is the one used
∗
Equal contribution
1
https://doi.org/10.5281/zenodo.3366685
2
From the middle of the 15
th
century to the end of the 18
th
century.
3
For more information about this approach see http://ocr-d.de.