Distributional Semantic Representation in Health Care Text Classification NLP_CEN_AMRITA@CHIS-FIRE-2016 Barathi Ganesh HB Artificial Intelligence Practice Tata Consultancy Services Kochi - 682 042 India barathiganesh.hb@tcs.com Anand Kumar M and Soman KP Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita University, India m anandkumar@cb.amrita.edu, kp soman@.amrita.edu ABSTRACT This paper describes about the our proposed system in the Consumer Health Information Search (CHIS) task. The ob- jective of the task 1 is to classify the sentences in the doc- ument into relevant or irrelevant with respect to the query and task 2 is analysing the sentiment of the sentences in the documents with respect to the given query. In this proposed approach distributional representation of text along with its statistical and distance measures are carried over to perform the given tasks as a text classification problem. In our ex- periment, Non - Negative Matrix Factorization utilized to get the distributed representation of the document as well as queries, distance and correlation measures taken as the features and Random Forest Tree utilized to perform the classification. The proposed approach yields 70.19% in task 1 and 34.64% in task 2 as an average accuracy. Keywords Health Science; Distributional Semantics; Non-Negative Ma- trix Factorization; Term - Document Matrix; Text Classifi- cation 1. INTRODUCTION Over the past few years, tremendous amount of invest- ment and research carried on to enhance the predictive an- alytics through text analytics in health care domain [11, 10]. Health care information are available as a text (Clin- ical Trails) in the form of admission notes, literature, re- ports and summaries 123 . Unlike traditional structure of text resources, the unstructured nature of clinical trial’s text sources are introduces more challenges while mining information out of it. These available challenges induces re- searchers to carry out the text analytics research to enhance the developed model and to create the new models. The informations explicitly available in Electronics Health Records (EHR) but implicitly available in clinical trails as a form of text. Now, our primary problem is becomes, repre- senting text that can be easily and effectively used for further 1 https://medlineplus.gov/ 2 https://clinicaltrials.gov/ 3 https://clinicaltrials.gov/ application. The application may be a sequential modeling tasks (Information Extraction) or text classification tasks (Document Retrieval, sentiment analysis on retrieved docu- ments and Validation of retrieved documents). Document retrieval is primary task in text analytics ap- plication in which the Consumer Health Information Search (CHIS) is focused on validating the retrieved results (Rel- evant or Irrelevant) and performing sentiment analysis on retrieved results (Support, Oppose and Neutral). The given problem can be viewed as a text classification problem with the target classes as mentioned in above two tasks. Text classification is a classic application in text analytics domain, that is utilized in the multiple domains and indus- tries in various forms. Given a text content, the classifier must have the capability of classifying it into the prede- fined set of classes [1]. This task becomes more complex, when the text contents includes medical descriptions (Drug names, Measurements and Dosages). This introduces the problem during the representation as well as while mining information out of it. The fundamental component in classification task is text representation, which tries to represent the given text into its equivalent form of numerical components. Later, these nu- merical components are utilized directly for the classification or will be used to extract the features required to perform the classification task. This text representation methods evolved over the time to improve the originality of representation, which paves way to move from the frequency based repre- sentation methods to the semantic representation methods. Though other methods are also available, this paper focuses only on Vector Space Model (VSM) and Vector Space Model of Semantics (VSMs) [13]. In VSM, the text is represented as a vector, based on the occurrence of terms (binary matrix) or frequency of the oc- currence of terms (Term - Document Matrix) present in the given text. The given text is represented as a vector, based on frequency of terms that occur within the text by having vocabulary built across the entire corpus. Here, ’terms’ rep- resents the words or the phrases [8]. Considering only the term frequency is not sufficient, since it ignores the syntactic and semantic information that lies within the text. The term documents matrix is inefficient due to the bias- ing problem (i.e. few terms gets higher weight because of un-