1 Associating Risks of Getting Strokes with Data from Health Checkup Records using Dempster-Shafer Theory Sergio Pe˜ nafiel*, Nelson Baloian*, Jose A. Pino*, Jorge Quinteros*, ´ Alvaro Riquelme*, Horacio Sanson**, Douglas Teoh** *Department of Computer Science, Universidad de Chile, Santiago, Chile **Allm Inc., Tokyo, Japan spenafiel@dcc.uchile.cl, nbaloian@dcc.uchile.cl, jpino@dcc.uchile.cl, jorge.quinteros@ug.uchile.cl, alvaroriquelme@uchile.cl, horacio@allm.net, d.teoh@allm.net Abstract—Prediction of future diseases from historical data of medical patients is a topic that has gained increasing in- terest given the growing availability of such data in electronic format. Most of the developed systems are based on machine learning techniques, which are good to find relations between data but do not help explaining causalities. In particular, it would be difficult to get a meaningful medical explanation for the relationship between a patient’s health checkup data and the risk of developing a certain disease. On the other hand, expert system approaches, like Bayesian networks, are based on medical knowledge but have trouble dealing with high levels of uncertainty, which is crucial in this kind of scenario. In this work we present a prediction system for the risk of a patient having a (heart or brain) stroke based on past medical checkup data. The system is based on the Dempster-Shafer Theory of plausibility which is good for handling uncertainty. The data used belongs to a rural hospital in Okayama, Japan, where people are compelled to undergo annual health checkups by law. The model also produces rules that are able to relate data from exam results with the aforementioned risk, thus proposing a cause from the medical point of view. Experiments comparing the results of the Dempster-Shafer method with other machine learning methods like Multilayer perceptron, Quadratic discriminant analysis and Naive Bayes show that our approach performed the best in general, with an overall prediction accuracy of 61% and with the best precision value on true positive cases of stroke. I. I NTRODUCTION Cardiovascular disease (CD), a non-communicable disease (ND) [1], [2], is the leading global cause of death, accounting for more than 17.3 million deaths per year in 2013, a number that is expected to grow to more than 23.6 million by 2030. In 2015, globally an estimated 7.4 million deaths were due to coronary heart disease and another 6.7 million deaths were due to stroke. In 2010 the estimated global cost of CD was calculated to be $863 billion, and it is estimated to rise to $1044 billion by 2030. Only in Japan, in 2012 around 55000 under 70 years old died because of CD [3]. The World Health Organization’s Global Strategy for the Prevention and Control of non-communicable Diseases (2013) recommends the evaluation of various strategies for early de- tection and screening of ND to promote research on prevention and control of these diseases [3]. Being aware of the risk of developing these illnesses in early stages has proved to be beneficial not only for prevention, but also for treatment of patients [2]. It is known that many countries gather electronic records of patients’ medical checkups from over the years. These records usually include symptoms, examination results and the history of diseases of each patient. In the case of Japan, all active workers have to undergo every year a medical checkup to monitor their health, data that is also electronically recorded. This results in large databases of health information that can potentially be used in many ways to train classification or prediction models. The problem of disease risk prediction has been studied through different approaches including statistical analysis [4], if-then rules [5], data mining [6] and machine learning [7]– [9]. In recent years, with the rise of new machine learning techniques and the growing amounts of medical data available, the problem has gained renewed interest from the data science community. Generally speaking, there are mainly two different ap- proaches for designing predictive algorithms in a variety of domains: the expert system approach, which consists in “encoding” the knowledge of experts in the required field in an algorithm, like if-then rules, Bayesian networks or association rules (like the ones implemented in the Prolog programming language), and the “machine learning” approach which consists in algorithms that handle large amounts of data, searching for relations among them, like neuronal networks, decision trees, genetic algorithms, particle swarm optimization and multilinear regressions. Both approaches have pros and cons. Expert systems need the collaboration of experts, who very frequently cannot struc- ture their knowledge, thus making it difficult to code it into an algorithm. It is also difficult to model some uncertainties experts have into this kind of systems [10]. A very important advantage of expert systems is that they allow to somehow comprehend and check why a certain prediction is made, which in the medical scenario is at least important, if not mandatory [11]. On the other hand, machine learning methods have the advantage that sometimes they can help discover relationships and causalities that were not previously known by experts. 239 International Conference on Advanced Communications Technology(ICACT) ISBN 979-11-88428-00-7 ICACT2018 February 11 ~ 14, 2018