1 ASSIST: EMPLOYING INFERENCE AND SEMANTIC TECHNOLOGIES TO FACILITATE ASSOCIATION STUDIES ON CERVICAL CANCER P. Mitkas*, C. Maramis*, A. Delopoulos*, A. Symeonidis*, S. Diplaris*, M. Falelakis*, F. Psomopoulos*, A. Batzios*, N. Maglaveras**, I. Lekka**, V. Koutkias**, T. Agorastos**, T. Mikos** and A. Tatsis** * Dept. Electrical and Computer Engineering, Aristotle University, Thessaloniki, Greece ** Medical School, Aristotle University, Thessaloniki, Greece chmaramis@mug.ee.auth.gr Abstract: Advances in biomedical engineering have lately facilitated medical data acquisition, leading to increased availability of both genetic and phenotypic patient. Particularly, in the area of cervical cancer intensive research investigates the role of specific genetic and environmental factors in determining the persistence of the HPV virus – which is the primary causal factor of cervical cancer – and the subsequent progression of the disease. To this direction, genetic association studies constitute a widely used scientific approach for medical research. However, despite the increased data availability worldwide, individual studies are often inconclusive due to the physical and conceptual isolation of the medical centers that limit the pool of data actually available to each researcher. ASSIST, an EU-funded research project, aims at facilitating medical research on cervical cancer by tackling these data isolation issues. To accomplish that, it virtually unifies multiple patient record repositories, physically located at different sites and subsequently employs inferencing techniques on the unified medical knowledge to enable the execution of cervical cancer related association studies that comprise both genotypic and phenotypic study factors, allowing medical researchers to perform more complex and reliable association studies on larger, high-quality datasets. Introduction During the last years, advances in the area of biomedical engineering have allowed for more accurate and detailed data acquisition in the area of health care. This has led to an increase in the availability of patient data of both phenotypic and, most importantly, genotypic nature. Such data are nowadays produced in abundance by once laborious examinations and are being used for diagnosis and successful treatment but also in medical research. However, despite the increased data availability, scientific progress is hindered by the fact that each medical center operates in relative isolation, both physical and conceptual. This means that the produced data not only reside in physically isolated repositories but are also stored in different knowledge representation forms since there is no universally accepted knowledge representation prototype for medical data acquisition, data storage and labeling. When it comes to the area of cervical cancer (CxCa), which is the second leading cause of cancer-related deaths after breast cancer for women between 20 and 39 years old [1] and one of the leading types of cancer affecting women worldwide, it has been proven that infection by the human papillomavirus (HPV) is necessary condition for the disease [2]. However, since HPV infection is highly unlikely to be the sole cause for developing cancer, intensive ongoing research investigates the role of specific genetic and environmental factors in determining the persistence of the HPV virus and subsequent progression of the disease [3]. To this direction, genetic association studies, i.e. studies that aim at detecting associations between one or more genetic variants and a trait (e.g. a disease) [4], constitute a widely used scientific approach in medical research. If a statistical correlation is observed between genotype and phenotype, an association between the variant and the trait is inferred [5]. The quality of the association studies conclusions heavily depends on the size of the available dataset. Low numbers of patient records lead to doubtful conclusions. This is the reason why several studies are often inconclusive, since the datasets employed are small and of poor quality due to the isolation issues mentioned in the previous paragraph. ASSIST (Association Studies aSsisted by Inference and Semantic Technologies) is an EU-funded research project that aims at facilitating medical research on CxCa by tackling these isolation issues at both physical and semantic level. ASSIST overcomes the problem of physical isolation of data sources by supporting a 3-tier architecture; Researchers conducting association studies have access to all participating patient record repositories physically located at different medical research centers and/or hospitals through the single node of ASSIST. This would be sufficient if the multiple repositories had the same internal schema, the same detail of information and the same terminology. However, this is not the case in practice. The lack of a common representation standard for CxCa related data, the detail of relevant examination result, as well as