Vision and Natural Language for Metadata Extraction from
Scientific PDF Documents: A Multimodal Approach
Zeyd Boukhers and Azeddine Bouabdallah
University of Koblenz-Landau, Germany
{boukhers,bazeddine}@uni-koblenz.de
ABSTRACT
The challenge of automatically extracting metadata from scien-
tifc PDF documents varies depending on the diversity of layouts
within the PDF collection. In some disciplines such as German so-
cial sciences, the authors are not required to generate their papers
according to a specifc template and they often create their own tem-
plates which yield a high appearance diversity across publications.
Overcoming this diversity using only Natural Language Processing
(NLP) approaches is not always efective which is refected in the
metadata unavailability of a large portion of German social science
publications. Therefore, we propose in this paper a multimodal
neural network model that employs NLP together with Computer
Vision (CV) for metadata extraction from scientifc PDF documents.
The aim is to beneft from both modalities to increase the overall
accuracy of metadata extraction. The extensive experiments of the
proposed model on around 8800 documents proved its efectiveness
over unimodal models, with an overall F1 score of 92.3%.
CCS CONCEPTS
· Computing methodologies → Neural networks; Supervised
learning;· Applied computing → Document management
and text processing.
KEYWORDS
metadata extraction, multimodal ML, NLP, CV
ACM Reference Format:
Zeyd Boukhers and Azeddine Bouabdallah, University of Koblenz-
Landau, Germany, {boukhers,bazeddine}@uni-koblenz.de. 2022. Vision and
Natural Language for Metadata Extraction from Scientifc PDF Documents:
A Multimodal Approach. In The ACM/IEEE Joint Conference on Digital Li-
braries in 2022 (JCDL ’22), June 20ś24, 2022, Cologne, Germany. ACM, New
York, NY, USA, 5 pages. https://doi.org/10.1145/3529372.3533295
1 INTRODUCTION
With the continuous expansion usage of digital libraries, a huge
amount of scientifc papers are published every year in a digital
format. Specifcally, nearly two million scientifc papers are pub-
lished each year [2]. These papers require automatic processing to
ease their use for scholars such as querying papers, citation count,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from permissions@acm.org.
JCDL ’22, June 20ś24, 2022, Cologne, Germany
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9345-4/22/06. . . $15.00
https://doi.org/10.1145/3529372.3533295
paper recommendations, etc. Therefore, the availability of metadata
(i.e. title, authors, year of publication, etc.) is important. However,
in some disciplines such as German social science, an important
number of the published papers are not covered in accessible bibli-
ographic databases [9, 26]. This means that the metadata can only
be obtained by extracting it from PDF documents.
Intuitively, the metadata is extracted using NLP approaches [21],
which demonstrated their efciency on English documents due
to the relatively standard layout in English corpora [6]. German
scientifc papers often come in a large variety of layouts because
they are mainly published by small and mid-size publishers who do
not impel to use standard templates. Since the German social science
community is relatively small, there is not much work addressing
this problem [10, 12]. To overcome this, we proposed in an earlier
study [6] to tackle this problem using CV techniques by viewing
the PDF as an RGB image. Although this approach demonstrated
promising performance on a challenging dataset, it still fails to
accurately extract some patterns (e.g. DOI), which are supposed to
be easily extracted by NLP-based approaches.
Introducing and considering multiple types of input data by com-
bining NLP and CV has proven its efectiveness in previous works
in diferent felds [28]. Therefore, this paper tackles the problem of
automatically extracting metadata from scientifc documents in the
German language using a multimodal approach that views a PDF
document both as an RGB image and as a textual document. With
this, we assume that jointly learning both modalities can lead to a
better understanding of the documents.
To this end, we trained a multimodal neural network model with
two sub-models; the frst one is a BiLSTM model fed with the layout
and context features of the content and the second one takes as
input the image representation of the PDF document. Using late
fusion, the output vectors generated by the two sub-models are
concatenated and used as input to another BiLSTM model which
classifes each token.
Following this section, Section 2 discusses the related works.
Section 3 presents the proposed approach and Section 4 presents
the conducted experiments and the obtained results that validate the
efectiveness of the proposed approach. Finally, Section 5 concludes
this paper and gives insight into future directions.
2 RELATED WORK
In this section, we categorize the related works on extracting meta-
data from PDF documents into three main categories.
2.1 Natural Language Processing
[14] considers two types of metadata extraction methods, namely
machine learning-based [15, 23, 24] and rule-based approaches [16,
19]. Machine learning approaches such as CiteSeerX [20] train