TF-IDF vs Word Embeddings for Morbidity Identification in Clinical Notes: An Initial Study Danilo Dess` ı 1[0000-0003-3843-3285] , Rim Helaoui 2[0000-0001-6915-8920] , Vivek Kumar 1[0000-0003-3958-4704] , Diego Reforgiato Recupero 1[0000-0001-8646-6183] , and Daniele Riboni 1[0000-0002-0695-2040] 1 University of Cagliari, Cagliari, Italy {danilo dessi, vivek.kumar, diego.reforgiato, riboni}@unica.it 2 Philips Research, Eindhoven, Netherlands rim.helaoui@philips.com Abstract. Today, we are seeing an ever-increasing number of clinical notes that contain clinical results, images, and textual descriptions of pa- tient’s health state. All these data can be analyzed and employed to cater novel services that can help people and domain experts with their com- mon healthcare tasks. However, many technologies such as Deep Learn- ing and tools like Word Embeddings have started to be investigated only recently, and many challenges remain open when it comes to healthcare domain applications. To address these challenges, we propose the use of Deep Learning and Word Embeddings for identifying sixteen morbidity types within textual descriptions of clinical records. For this purpose, we have used a Deep Learning model based on Bidirectional Long-Short Term Memory (LSTM) layers which can exploit state-of-the-art vector representations of data such as Word Embeddings. We have employed pre-trained Word Embeddings namely GloVe and Word2Vec, and our own Word Embeddings trained on the target domain. Furthermore, we have compared the performances of the deep learning approaches against the traditional tf-idf using Support Vector Machine and Multilayer per- ceptron (our baselines). From the obtained results it seems that the lat- ter outperform the combination of Deep Learning approaches using any word embeddings. Our preliminary results indicate that there are specific features that make the dataset biased in favour of traditional machine learning approaches. Keywords: Deep Learning · Natural Language Processing · Morbidity Detection · Word Embeddings · Classification. 1 Introduction In these years we are seeing an increment of life expectancy that has also in- creased the risk of long-term diseases such as cancer, diabetes, mental health condition, and other chronic health threats [22, 10, 3, 21]. Also, one more disad- vantage with long life expectancy is that people can be affected by more than one Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1