AbstractTuberculosis is an infectious disease that is spread through the air from one person to another and is one of the top ten causes of death in the world according to the World Health Organization. From biomedical engineering, decision support systems based on artificial intelligence have shown advantages for healthcare personnel in tasks such as diagnosis and screening. A specific area of the artificial intelligence is the natural language processing, however, most of these approaches are based on available data. This paper shows the construction of a dataset based on medical records of subjects suspected of tuberculosis. In addition, an initial exploration of the contents of the constructed dataset and how this approach can be followed by a natural language processing to support tuberculosis diagnosis in data demanding scenarios are presented. Clinical RelevanceIn some developing countries as Colombia, it is difficult to develop systems based on artificial intelligence due to the availability of data. This proposal holds a strategy to build a dataset to train machine learning models, and to obtain support diagnosis tools, employing natural language from the medical scenario from text written by health professionals in the medical record. In this way, trained models based on this information available can be employed in places where medical infrastructure is precarious. I. INTRODUCTION Tuberculosis (TB) is an infectious disease caused by the Mycobacterium tuberculosis. This bacterium mostly attacks the lungs, but can also affect other parts of the body, and it can be easily spread between people through the air, especially in areas with high population density values and low socioeconomic conditions [1]. Additionally, TB has been recognized as a universal emergency by the World Health Organization (WHO) due to its worldwide impact, and it is among the top 10 leading causes of death by a single infectious agent, thus, ending the TB pandemic by 2030 is one of the health-related targets of the Sustainable Development Goals (SDGs) [2]. For developing countries, the situation is especially difficult. Detection of TB is challenging due to the limitations Andrés Romero is with the Biomedical Engineering Program, Escuela Colombiana de Ingeniería Julio Garavito Universidad del Rosario, Bogotá D.C., Colombia (e-mail: andres.romero-go@mail.escuelaing.edu.co). Alvaro D. Orjuela-Cañón is with the School of Medicine and Health Sciences, Universidad del Rosario, Bogotá D.C., Colombia (e-mail: alvaro.orjuela@uroasrio.edu.co). Andrés L. Jutinico and Erika Vergara are with the Mechanical, Electronics and Biomedical Faculty, Universidad Antonio Nariño, Bogotá D.C., Colombia (e-mail: ajutinico@uan.edu.co, paoli1982@gmail.com). Carlos Awad and Angélica Palencia are with the Subred Integrada de Servicios de Salud Centro Oriente, Bogotá D.C., Colombia (email: carlosawad@gmail.com, angelicapalenciab@gmail.com). of the medical infrastructure [1] [3]. Specialized laboratories are demanded to determine if TB suspected people holds the disease. In addition, health professionals to cover regions faraway from big cities represent a problem in places with basic structure for public health. In this way, Colombia as a Latin American country, holds a TB incidence that fluctuates during the last 10 years with a certain tendency to increase. For 2019, it had an incidence of 26.9 per 100 thousand inhabitants. This increasing behavior can be explained by the strengthening of surveillance and monitoring actions of the disease that have been carried out in the country [4]. From biomedical engineering, different strategies in the health area have been increased with new proposals to solve traditional problems. In this way, applications based on artificial intelligence (AI) techniques are being developed, highlighting the so-called decision support systems (DSS). These techniques have been shown to be useful as support tools in tasks for the diagnosis and prognosis of diseases, providing an extra-help to health professionals, contributing more and new sights to treat the problems [5] [6] [7]. For the specific case of TB diagnosis support, AI has been employed in different scenarios with the use of artificial neural networks in demanding scenarios [8], proposals based on images [9], and other applications [10], [11]. Furthermore, recent applications of the AI are related to the natural language processing (NLP), which is a computational approach that allows to analyze text that is written in an unstructured mode, as in the case of medical records (MRs). NLP is usually performed in the clinical setting using techniques based on rules given by an expert or a system, but it has been seen that techniques based on machine learning (ML) to increase the performance [12]. In medicine, the NLP has been used in tasks such as extracting relevant information from gastroenterological reports [13], to determine the eligibility of patients for intravenous thrombolytic therapy [14], to manage patients with heart failure from MRs [15], or to support the diagnosis of respiratory diseases from chest X- rays using radiologists reports [16]. However, these approaches are developed with the use of data, which is a problem if the data is unstructured and unavailable. This paper shows the development of a database with clinical reports of suspected TB patients extracted from their MRs. The extracted texts contain information about the patient's health status at times prior to the diagnosis of tuberculosis, with the aim of providing this information to an NLP system that supports the diagnosis of active TB. In addition, some relevant aspects related to specific medical language are provided for the findings in this study. Preliminary Text Analysis from Medical Records for TB Diagnosis Support Andrés Felipe Romero Gómez, Alvaro D. Orjuela-Cañón, Member, SMIEEE, Andrés L. Jutinico, Carlos Awad, Erika Vergara, and Angélica Palencia 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) Oct 31 - Nov 4, 2021. Virtual Conference 978-1-7281-1178-0/21/$31.00 ©2021 IEEE 2468