Distributional Semantic Representation in Health Care Text Classiﬁcation NLP_CEN_AMRITA@CHIS-FIRE-2016 Barathi Ganesh HB Artiﬁcial Intelligence Practice Tata Consultancy Services Kochi - 682 042 India barathiganesh.hb@tcs.com Anand Kumar M and Soman KP Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita University, India m anandkumar@cb.amrita.edu, kp soman@.amrita.edu ABSTRACT This paper describes about the our proposed system in the Consumer Health Information Search (CHIS) task. The ob- jective of the task 1 is to classify the sentences in the doc- ument into relevant or irrelevant with respect to the query and task 2 is analysing the sentiment of the sentences in the documents with respect to the given query. In this proposed approach distributional representation of text along with its statistical and distance measures are carried over to perform the given tasks as a text classiﬁcation problem. In our ex- periment, Non - Negative Matrix Factorization utilized to get the distributed representation of the document as well as queries, distance and correlation measures taken as the features and Random Forest Tree utilized to perform the classiﬁcation. The proposed approach yields 70.19% in task 1 and 34.64% in task 2 as an average accuracy. Keywords Health Science; Distributional Semantics; Non-Negative Ma- trix Factorization; Term - Document Matrix; Text Classiﬁ- cation 1. INTRODUCTION Over the past few years, tremendous amount of invest- ment and research carried on to enhance the predictive an- alytics through text analytics in health care domain [11, 10]. Health care information are available as a text (Clin- ical Trails) in the form of admission notes, literature, re- ports and summaries 123 . Unlike traditional structure of text resources, the unstructured nature of clinical trial’s text sources are introduces more challenges while mining information out of it. These available challenges induces re- searchers to carry out the text analytics research to enhance the developed model and to create the new models. The informations explicitly available in Electronics Health Records (EHR) but implicitly available in clinical trails as a form of text. Now, our primary problem is becomes, repre- senting text that can be easily and eﬀectively used for further 1 https://medlineplus.gov/ 2 https://clinicaltrials.gov/ 3 https://clinicaltrials.gov/ application. The application may be a sequential modeling tasks (Information Extraction) or text classiﬁcation tasks (Document Retrieval, sentiment analysis on retrieved docu- ments and Validation of retrieved documents). Document retrieval is primary task in text analytics ap- plication in which the Consumer Health Information Search (CHIS) is focused on validating the retrieved results (Rel- evant or Irrelevant) and performing sentiment analysis on retrieved results (Support, Oppose and Neutral). The given problem can be viewed as a text classiﬁcation problem with the target classes as mentioned in above two tasks. Text classiﬁcation is a classic application in text analytics domain, that is utilized in the multiple domains and indus- tries in various forms. Given a text content, the classiﬁer must have the capability of classifying it into the prede- ﬁned set of classes [1]. This task becomes more complex, when the text contents includes medical descriptions (Drug names, Measurements and Dosages). This introduces the problem during the representation as well as while mining information out of it. The fundamental component in classiﬁcation task is text representation, which tries to represent the given text into its equivalent form of numerical components. Later, these nu- merical components are utilized directly for the classiﬁcation or will be used to extract the features required to perform the classiﬁcation task. This text representation methods evolved over the time to improve the originality of representation, which paves way to move from the frequency based repre- sentation methods to the semantic representation methods. Though other methods are also available, this paper focuses only on Vector Space Model (VSM) and Vector Space Model of Semantics (VSMs) [13]. In VSM, the text is represented as a vector, based on the occurrence of terms (binary matrix) or frequency of the oc- currence of terms (Term - Document Matrix) present in the given text. The given text is represented as a vector, based on frequency of terms that occur within the text by having vocabulary built across the entire corpus. Here, ’terms’ rep- resents the words or the phrases [8]. Considering only the term frequency is not suﬃcient, since it ignores the syntactic and semantic information that lies within the text. The term documents matrix is ineﬃcient due to the bias- ing problem (i.e. few terms gets higher weight because of un-