A Machine Learning Approach for the Classification of Disease Risks in Time Series Lejla Begic Fazlic * , Ahmed Hallawa , Matthias Dziubany * , Marlies Morgen * , Jens Schneider * , Marvin Schacht * , Anke Schmeink , Lukas Martin , Arne Peine , Thomas Vollmer § Stefan Winter § and Guido Dartmann * * ISS, Trier University of Applied Sciences, Trier, Germany ISEK Research and Teaching Area, RWTH Aachen University, Aachen, Germany § Philips GmbH Innovative Technologies, Aachen, Germany Department of Intensive and Intermediate Care, University Hospital Aachen, Aachen, Germany Abstract—In this work, a new hybrid algorithm for disease risk classification is proposed. The proposed methodology is based on Dynamic Time Warping (DTW). This methodology can be applied to time series from various domains such as vital sign time series available in medical big data. To validate our methodology, we applied it to risk classification for sepsis, which is one of the most challenging problems within the area of medical data analysis. In the first step the algorithm uses different statistical properties of time series features. Furthermore, using differently labeled training data sets, we created a DTW Barycenter Averaging (DBA) on each feature. In the second step, validation data sets and DTW are used to validate the precision of classification and the final results are compared. The performance of our methodology is validated with real medical data and on six different criteria definitions for the sepsis diseases. Results show that our algorithm performed, in the best case, with precision and recall of 96,38% and 90,90%, respectively. Index Terms—Machine Learning, Time Series , Dynamic Time Warping, Data Mining, Sepsis I. I NTRODUCTION Time series data extracted from electronic health records play an important role for improving medical care. Using different statistical methods, these retrospective data have been used to understand the relationship between inputs and outcomes or to find similar patterns for specific patient groups. If an accurate diagnosis is provided in the right time, the appropriate treatment can be provided and the patient has the best chance for a positive health outcome. Since early treatment of sepsis increases the chance of positive outcomes, a rapid diagnosis is crucial. As extracting and labeling sepsis data is not a trivial task, clinical risk prediction is very complex and depends on expe- rience, how one chooses criteria, the time of prediction and the prediction horizon. In this research, we developed a novel methodology for disease risk classification using retrospective data [1], [2]. The algorithm is based on the principles of DBA, DTW and additional statistical methods. In the training phase, we merged all the patients’ features data by creating DBA in a positive and negative sense. In the validation phase, we used validation data to validate the precision of the classification. If the sample ”is positive and it is classified as positive, it is counted as a true positive (TP); if it is classified as negative, it is considered as a false negative (FN)” [3]. If the sample ”is negative and it is classified as negative it is considered as true negative (TN); if it is classified as positive, it is counted as false positive (FP)” [3]. Recall of a classifier ”represents the positive correctly classified samples to the total number of positive samples” [3]. Precision ”represents the proportion of positive samples that were correctly classified to the total number of positive predicted samples” [3]. At the final step, all results by precision and recall for different labeled data are compared. We also show the high impact of the different formulation of disease criteria (differently labeled data) on performance. The paper is organized as follows: In Section I, we describe related problems and the clinical challenge of sepsis identi- fication. Furthermore, we review existing research studies on different approaches of combination of statistical and Machine Learning (ML) approaches. In Section III, we explain the algorithm design and methods used in our research, where we describe the proposed methodology including data acquisition and pre-processing. Section IV proposes our methodology, where we describe the algorithm design architecture. Section V illustrates and compares the numerical results for different training data according to the different criteria for the sepsis disease. Finally, a conclusion is given in Section VI. II. RELATED WORKS The mortality rates due to some diseases like, for example heart diseases or sepsis [4], are very high worldwide, so risk prediction plays a very important role. Their diagnosis requires a lot of experience, time and knowledge. For example, authors in research work [5] evaluated the relative validity of sepsis identification criteria in a large database with intensive care unit patients. The monitoring and clinical challenge of sepsis identification is also presented by [6]. The authors in [7] developed a novel traumatic sepsis score (TSS) whose validation results allow a reliable prediction of the sepsis risk. Furthermore they constructed a model using logistic regression based on a LASSO analysis. The authors in [8] used a statistical approach where they tried to derive and internally validate the sepsis risk score to predict future sepsis events. Using recorded vital signs and results of lab values from blood tests, C-statistic models and software-aided risk scores for