A Semantic Feature Space for Disease Prediction Mariam Daoud , Jimmy Xiangji Huang , William Melek , C. Joseph Kurian Information Retrieval and Knowledge Management Research Lab School of Information Technology, York University,Toronto, Canada Email: {daoud,jhuang}@yorku.ca Alpha Global IT, Toronto, Canada Email: {william,cjk}@alpha-it.com EXTENDED ABSTRACT The huge amount of data generated by modern medicine has motivated us to develop decision support systems for im- proving health care applications. In this paper, we address the problem of clinical disease prediction given patient-reported symptoms and medical signs where patient records lack of se- mantic code annotation. We propose a novel context-enhanced disease prediction approach based on leveraging semantic and contextual medical entity relations. We have already exploited semantic relations of medical terminology for patient records search [2] but they were never considered for disease pre- diction in the literature. Patient signs and symptoms are first mapped to SNOMED-CT concepts, which compose a feature space for disease prediction. Our major contributions in this paper consist of expanding the feature space using semantic and contextual concept relations of SNOMED-CT. Based on patient’s reported signs and symptoms, we use biomedical text mining tool, namely Metamap [1] to extract concepts of the SNOMED-CT metathesaurus. A “concept” in SNOMED-CT is a clinical meaning identified by a unique numeric identifier (ConceptId) and described via a set of words. For each concept, we define a medical entity context by integrating “defining” and “qualitative” medical aspects through the use of different types of semantic and contextual relationships of SNOMED- CT. Figure 1 illustrates the concept “Pneumonia” and its relations to other concepts. Fig. 1. Illustration of relations in SNOMED-CT A case study is conducted on a real medical dataset provided by Alpha Global IT healthcare company located in Canada. Patient records are pre-annotated with diseases. We evaluate the impact of our proposed feature space on the disease prediction performance. Figure 2 presents the classification accuracy of the support vector machines classifier (SMO) using different types of medical relations on cardi- ology patient records dataset. We choose SVM for studying the impact of relations types on disease prediction since it performed best compared to other classifiers. We notice  Fig. 2. Impact of relation types on cardiology disease prediction that most of the concept relations types have shown positive impact on the disease prediction accuracy where expanding the concepts with all types of medical relations (labelled “all”) has performed best. The positive relations are “interprets” and “has-episodicity”. Using all relations types “all” to expand the feature space provides the highest accuracy. The negative impact of some relations types could be due to a high re- latedness in symptom descriptions between different diseases. When using the relations “synonyms”, “same as” or “replace”, the overlapping features between diseases that present few common symptoms increase, which makes the disease type hard to identify. For example, the symptoms “Breathless” and “Palpitation” are common for 16 and 12 cardiology diseases respectively where the total number of diseases is 21. ACKNOWLEDGEMENTS This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada and Mathematics of Information Technology and Complex Sys- tems (MITACS). We also thank reviewers for their valuable comments on this paper. REFERENCES [1] A. R. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program. In AMIA, Annual Symposium, pages 17–21, 2001. [2] A. Babashzadeh, J. Huang, and M. Daoud. Exploiting semantics for improving clinical information retrieval. In SIGIR, pages 801–804, 2013.  2013 IEEE International Conference on Bioinformatics and Biomedicine 978-1-4799-1310-7/13/$31.00 ©2013 IEEE