Information Extraction from Spanish Radiology
Reports using multilingual BERT
Oswaldo Solarte-Pabón
1,2
, Orlando Montenegro
2
, Alberto Blazquez-Herranz
1
,
Hadi Saputro
1
, Alejandro Rodriguez-González
1
and Ernestina Menasalvas
1
1
Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Spain
2
Universidad del Valle, Cali, Colombia
Abstract
This paper describes our team’s participation in Task 1 of the Conference and Labs of the Evaluation
Forum (CLEF eHealth 2021). The Task 1 challenge targets Named Entity Recognition (NER) from radiol-
ogy reports written in Spanish. Our approach addresses this challenge as a sequence labeling task and
is based on multilingual BERT with a classifcation layer on top. Three BERT-based models were trained
to support overlapping entities extraction: the frst model predicts the frst specifc label annotated in
the corpus; the second predicts the second label for tokens that have two diferent annotations; and the
third is used for tokens annotated with a third label in the corpus. Our approach obtained 78.47% and
73.27% for a Lenient and the exact F1 score, respectively.
Keywords
Information Extraction, Named Entity Recognition (NER), Multilingual BERT, Radiology Reports
1. Introduction
Radiology reports are one of the most important sources of clinical imaging information. They
document critical information about the patient’s health and the radiologist’s interpretation of
medical fndings [1]. Extracted information from Radiology reports can be used to support clini-
cal research, quality improvement, and evidence-based medicine [2]. However, the information
in radiology reports is presented in free text format, which makes the task of structuring the
data particularly challenging [3]. Extracting this information manually is not a viable task as it
is costly and time-consuming [4].
Several studies have been proposed to extract information from radiology reports [5, 6, 7].
Most of these proposals use machine learning methods, mainly based on the Conditional Random
Fields (CRF) algorithm [8], in which entity extraction is considered as a sequence labeling task.
Recently, deep learning approaches have been shown to improve performance at processing
natural language texts [9, 10, 11]. Previous studies have shown signifcant progress in extracting
information from clinical reports, but most eforts have focused only on the English language.
However, the Task 1 challenge of CLEF 2021 targets Named Entity Recognition (NER) and
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
oswaldo.solartep@alumnos.upm.es (O. Solarte-Pabón); orlando.montenegro@correounivalle.edu.co
(O. Montenegro); alberto.bherranz@upm.es (A. Blazquez-Herranz); H.saputro@alumnos.upm.es (H. Saputro);
alejandro.rg@upm.es (A. Rodriguez-González); ernestina.menasalvas@upm.es (E. Menasalvas)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)