Information Extraction from Spanish Radiology Reports using multilingual BERT Oswaldo Solarte-Pabón 1,2 , Orlando Montenegro 2 , Alberto Blazquez-Herranz 1 , Hadi Saputro 1 , Alejandro Rodriguez-González 1 and Ernestina Menasalvas 1 1 Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Spain 2 Universidad del Valle, Cali, Colombia Abstract This paper describes our team’s participation in Task 1 of the Conference and Labs of the Evaluation Forum (CLEF eHealth 2021). The Task 1 challenge targets Named Entity Recognition (NER) from radiol- ogy reports written in Spanish. Our approach addresses this challenge as a sequence labeling task and is based on multilingual BERT with a classifcation layer on top. Three BERT-based models were trained to support overlapping entities extraction: the frst model predicts the frst specifc label annotated in the corpus; the second predicts the second label for tokens that have two diferent annotations; and the third is used for tokens annotated with a third label in the corpus. Our approach obtained 78.47% and 73.27% for a Lenient and the exact F1 score, respectively. Keywords Information Extraction, Named Entity Recognition (NER), Multilingual BERT, Radiology Reports 1. Introduction Radiology reports are one of the most important sources of clinical imaging information. They document critical information about the patient’s health and the radiologist’s interpretation of medical fndings [1]. Extracted information from Radiology reports can be used to support clini- cal research, quality improvement, and evidence-based medicine [2]. However, the information in radiology reports is presented in free text format, which makes the task of structuring the data particularly challenging [3]. Extracting this information manually is not a viable task as it is costly and time-consuming [4]. Several studies have been proposed to extract information from radiology reports [5, 6, 7]. Most of these proposals use machine learning methods, mainly based on the Conditional Random Fields (CRF) algorithm [8], in which entity extraction is considered as a sequence labeling task. Recently, deep learning approaches have been shown to improve performance at processing natural language texts [9, 10, 11]. Previous studies have shown signifcant progress in extracting information from clinical reports, but most eforts have focused only on the English language. However, the Task 1 challenge of CLEF 2021 targets Named Entity Recognition (NER) and CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania  oswaldo.solartep@alumnos.upm.es (O. Solarte-Pabón); orlando.montenegro@correounivalle.edu.co (O. Montenegro); alberto.bherranz@upm.es (A. Blazquez-Herranz); H.saputro@alumnos.upm.es (H. Saputro); alejandro.rg@upm.es (A. Rodriguez-González); ernestina.menasalvas@upm.es (E. Menasalvas) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)