Vision and Natural Language for Metadata Extraction from Scientific PDF Documents: A Multimodal Approach Zeyd Boukhers and Azeddine Bouabdallah University of Koblenz-Landau, Germany {boukhers,bazeddine}@uni-koblenz.de ABSTRACT The challenge of automatically extracting metadata from scien- tifc PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German so- cial sciences, the authors are not required to generate their papers according to a specifc template and they often create their own tem- plates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always efective which is refected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientifc PDF documents. The aim is to beneft from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its efectiveness over unimodal models, with an overall F1 score of 92.3%. CCS CONCEPTS · Computing methodologies Neural networks; Supervised learningApplied computing Document management and text processing. KEYWORDS metadata extraction, multimodal ML, NLP, CV ACM Reference Format: Zeyd Boukhers and Azeddine Bouabdallah, University of Koblenz- Landau, Germany, {boukhers,bazeddine}@uni-koblenz.de. 2022. Vision and Natural Language for Metadata Extraction from Scientifc PDF Documents: A Multimodal Approach. In The ACM/IEEE Joint Conference on Digital Li- braries in 2022 (JCDL ’22), June 20ś24, 2022, Cologne, Germany. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3529372.3533295 1 INTRODUCTION With the continuous expansion usage of digital libraries, a huge amount of scientifc papers are published every year in a digital format. Specifcally, nearly two million scientifc papers are pub- lished each year [2]. These papers require automatic processing to ease their use for scholars such as querying papers, citation count, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. JCDL ’22, June 20ś24, 2022, Cologne, Germany © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9345-4/22/06. . . $15.00 https://doi.org/10.1145/3529372.3533295 paper recommendations, etc. Therefore, the availability of metadata (i.e. title, authors, year of publication, etc.) is important. However, in some disciplines such as German social science, an important number of the published papers are not covered in accessible bibli- ographic databases [9, 26]. This means that the metadata can only be obtained by extracting it from PDF documents. Intuitively, the metadata is extracted using NLP approaches [21], which demonstrated their efciency on English documents due to the relatively standard layout in English corpora [6]. German scientifc papers often come in a large variety of layouts because they are mainly published by small and mid-size publishers who do not impel to use standard templates. Since the German social science community is relatively small, there is not much work addressing this problem [10, 12]. To overcome this, we proposed in an earlier study [6] to tackle this problem using CV techniques by viewing the PDF as an RGB image. Although this approach demonstrated promising performance on a challenging dataset, it still fails to accurately extract some patterns (e.g. DOI), which are supposed to be easily extracted by NLP-based approaches. Introducing and considering multiple types of input data by com- bining NLP and CV has proven its efectiveness in previous works in diferent felds [28]. Therefore, this paper tackles the problem of automatically extracting metadata from scientifc documents in the German language using a multimodal approach that views a PDF document both as an RGB image and as a textual document. With this, we assume that jointly learning both modalities can lead to a better understanding of the documents. To this end, we trained a multimodal neural network model with two sub-models; the frst one is a BiLSTM model fed with the layout and context features of the content and the second one takes as input the image representation of the PDF document. Using late fusion, the output vectors generated by the two sub-models are concatenated and used as input to another BiLSTM model which classifes each token. Following this section, Section 2 discusses the related works. Section 3 presents the proposed approach and Section 4 presents the conducted experiments and the obtained results that validate the efectiveness of the proposed approach. Finally, Section 5 concludes this paper and gives insight into future directions. 2 RELATED WORK In this section, we categorize the related works on extracting meta- data from PDF documents into three main categories. 2.1 Natural Language Processing [14] considers two types of metadata extraction methods, namely machine learning-based [15, 23, 24] and rule-based approaches [16, 19]. Machine learning approaches such as CiteSeerX [20] train