Abstract— Tuberculosis is an infectious disease that is spread
through the air from one person to another and is one of the
top ten causes of death in the world according to the World
Health Organization. From biomedical engineering, decision
support systems based on artificial intelligence have shown
advantages for healthcare personnel in tasks such as diagnosis
and screening. A specific area of the artificial intelligence is the
natural language processing, however, most of these
approaches are based on available data. This paper shows the
construction of a dataset based on medical records of subjects
suspected of tuberculosis. In addition, an initial exploration of
the contents of the constructed dataset and how this approach
can be followed by a natural language processing to support
tuberculosis diagnosis in data demanding scenarios are
presented.
Clinical Relevance— In some developing countries as
Colombia, it is difficult to develop systems based on artificial
intelligence due to the availability of data. This proposal holds a
strategy to build a dataset to train machine learning models,
and to obtain support diagnosis tools, employing natural
language from the medical scenario from text written by health
professionals in the medical record. In this way, trained models
based on this information available can be employed in places
where medical infrastructure is precarious.
I. INTRODUCTION
Tuberculosis (TB) is an infectious disease caused by the
Mycobacterium tuberculosis. This bacterium mostly attacks
the lungs, but can also affect other parts of the body, and it
can be easily spread between people through the air,
especially in areas with high population density values and
low socioeconomic conditions [1]. Additionally, TB has been
recognized as a universal emergency by the World Health
Organization (WHO) due to its worldwide impact, and it is
among the top 10 leading causes of death by a single
infectious agent, thus, ending the TB pandemic by 2030 is
one of the health-related targets of the Sustainable
Development Goals (SDGs) [2].
For developing countries, the situation is especially
difficult. Detection of TB is challenging due to the limitations
Andrés Romero is with the Biomedical Engineering Program, Escuela
Colombiana de Ingeniería Julio Garavito – Universidad del Rosario, Bogotá
D.C., Colombia (e-mail: andres.romero-go@mail.escuelaing.edu.co).
Alvaro D. Orjuela-Cañón is with the School of Medicine and Health
Sciences, Universidad del Rosario, Bogotá D.C., Colombia (e-mail:
alvaro.orjuela@uroasrio.edu.co).
Andrés L. Jutinico and Erika Vergara are with the Mechanical,
Electronics and Biomedical Faculty, Universidad Antonio Nariño, Bogotá
D.C., Colombia (e-mail: ajutinico@uan.edu.co, paoli1982@gmail.com).
Carlos Awad and Angélica Palencia are with the Subred Integrada de
Servicios de Salud Centro Oriente, Bogotá D.C., Colombia (email:
carlosawad@gmail.com, angelicapalenciab@gmail.com).
of the medical infrastructure [1] [3]. Specialized laboratories
are demanded to determine if TB suspected people holds the
disease. In addition, health professionals to cover regions
faraway from big cities represent a problem in places with
basic structure for public health. In this way, Colombia as a
Latin American country, holds a TB incidence that fluctuates
during the last 10 years with a certain tendency to increase.
For 2019, it had an incidence of 26.9 per 100 thousand
inhabitants. This increasing behavior can be explained by the
strengthening of surveillance and monitoring actions of the
disease that have been carried out in the country [4].
From biomedical engineering, different strategies in the
health area have been increased with new proposals to solve
traditional problems. In this way, applications based on
artificial intelligence (AI) techniques are being developed,
highlighting the so-called decision support systems (DSS).
These techniques have been shown to be useful as support
tools in tasks for the diagnosis and prognosis of diseases,
providing an extra-help to health professionals, contributing
more and new sights to treat the problems [5] [6] [7].
For the specific case of TB diagnosis support, AI has
been employed in different scenarios with the use of artificial
neural networks in demanding scenarios [8], proposals based
on images [9], and other applications [10], [11]. Furthermore,
recent applications of the AI are related to the natural
language processing (NLP), which is a computational
approach that allows to analyze text that is written in an
unstructured mode, as in the case of medical records (MRs).
NLP is usually performed in the clinical setting using
techniques based on rules given by an expert or a system, but
it has been seen that techniques based on machine learning
(ML) to increase the performance [12]. In medicine, the NLP
has been used in tasks such as extracting relevant information
from gastroenterological reports [13], to determine the
eligibility of patients for intravenous thrombolytic therapy
[14], to manage patients with heart failure from MRs [15], or
to support the diagnosis of respiratory diseases from chest X-
rays using radiologists reports [16].
However, these approaches are developed with the use of
data, which is a problem if the data is unstructured and
unavailable. This paper shows the development of a database
with clinical reports of suspected TB patients extracted from
their MRs. The extracted texts contain information about the
patient's health status at times prior to the diagnosis of
tuberculosis, with the aim of providing this information to an
NLP system that supports the diagnosis of active TB. In
addition, some relevant aspects related to specific medical
language are provided for the findings in this study.
Preliminary Text Analysis from Medical Records for TB Diagnosis
Support
Andrés Felipe Romero Gómez, Alvaro D. Orjuela-Cañón, Member, SMIEEE, Andrés L. Jutinico,
Carlos Awad, Erika Vergara, and Angélica Palencia
2021 43rd Annual International Conference of the
IEEE Engineering in Medicine & Biology Society (EMBC)
Oct 31 - Nov 4, 2021. Virtual Conference
978-1-7281-1178-0/21/$31.00 ©2021 IEEE 2468