Tumor Morphology Mentions Identification Using
Deep Learning and Conditional Random Fields
Utpal Kumar Sikdar
a
, Björn Gambäck
b
and M Krishna Kumar
c
a
IBS Software Pvt. Ltd., Trivandrum, Techopark main gate, India-695581
b
Department of Computer Science, Norwegian University of Science and Technology, 7491 Trondheim, Norway
c
IBS Software Pvt. Ltd., Trivandrum, Techopark main gate, India-695581
Abstract
The paper reports the application of several machine learning methods to the task of automatically fnd-
ing tumor morphology mentions in Spanish clinical texts. Three setups based on Conditional Random
Fields (CRF) techniques with diferent feature combinations were tested as well as a deep learning model
(Bi-directional-LSTM-CNN). The best performance was achieved by combining two of the CRF-based
learners and the neural network using a majority voting ensemble.
Keywords
named entity recognition, CRF, Bi-LSTM, CNN, GloVe
1. Introduction
To understand diseases, we need to extract certain key entities such as symptoms, duration,
patient age and weight, etc. from unstructured textual medical data. This task, clinical text
mining, is important to enable better clinical decision-making. It is, for example, very helpful if
we can extract key entities from a pandemic situation (such as COVID-19, SARS, and locations)
and take appropriate actions based on the disease symptoms and their attributes. Natural
Language Processing flls an important role in extracting such key entities from diferent types
of textual sources in various languages.
A myriad of medical texts are generated each day in various languages. Only in Spanish,
almost a thousand electronic patient records are generated every minute. Hence automatically
processing clinical texts in Spanish is a challenging task, but with a large potential for the
medical user community as well as for the pharmaceutical industry and the patients.
Similar to Named Entity Recognition, tumor mention identifcation is a sequence labelling task.
Following results published by several researchers in 2016 [1, 2, 3], state-of-the-art work on such
sequence labelling tasks has focused on deep learning setups using a neural network structure,
in particular Long Short-Term Memory Recurrent Neural Networks [LSTM; 4], followed by
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: utpal.sikdar@gmail.com (U.K. Sikdar); gamback@ntnu.no (B. Gambäck); krishna.kumar@ibsplc.com (M.K.
Kumar)
url: https://www.linkedin.com/in/dr-utpal-kumar-sikdar-31a1779b/ (U.K. Sikdar);
https://www.ntnu.edu/employees/gamback (B. Gambäck);
https://www.linkedin.com/in/m-krishna-kumar-56383220/ (M.K. Kumar)
orcid: 0000-0002-5252-707X (B. Gambäck)
© 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)