A Machine Learning Approach for the Classiﬁcation of Disease Risks in Time Series Lejla Begic Fazlic * , Ahmed Hallawa † , Matthias Dziubany * , Marlies Morgen * , Jens Schneider * , Marvin Schacht * , Anke Schmeink † , Lukas Martin ‡ , Arne Peine ‡ , Thomas Vollmer § Stefan Winter § and Guido Dartmann * * ISS, Trier University of Applied Sciences, Trier, Germany † ISEK Research and Teaching Area, RWTH Aachen University, Aachen, Germany § Philips GmbH Innovative Technologies, Aachen, Germany ‡ Department of Intensive and Intermediate Care, University Hospital Aachen, Aachen, Germany Abstract—In this work, a new hybrid algorithm for disease risk classiﬁcation is proposed. The proposed methodology is based on Dynamic Time Warping (DTW). This methodology can be applied to time series from various domains such as vital sign time series available in medical big data. To validate our methodology, we applied it to risk classiﬁcation for sepsis, which is one of the most challenging problems within the area of medical data analysis. In the ﬁrst step the algorithm uses different statistical properties of time series features. Furthermore, using differently labeled training data sets, we created a DTW Barycenter Averaging (DBA) on each feature. In the second step, validation data sets and DTW are used to validate the precision of classiﬁcation and the ﬁnal results are compared. The performance of our methodology is validated with real medical data and on six different criteria deﬁnitions for the sepsis diseases. Results show that our algorithm performed, in the best case, with precision and recall of 96,38% and 90,90%, respectively. Index Terms—Machine Learning, Time Series , Dynamic Time Warping, Data Mining, Sepsis I. I NTRODUCTION Time series data extracted from electronic health records play an important role for improving medical care. Using different statistical methods, these retrospective data have been used to understand the relationship between inputs and outcomes or to ﬁnd similar patterns for speciﬁc patient groups. If an accurate diagnosis is provided in the right time, the appropriate treatment can be provided and the patient has the best chance for a positive health outcome. Since early treatment of sepsis increases the chance of positive outcomes, a rapid diagnosis is crucial. As extracting and labeling sepsis data is not a trivial task, clinical risk prediction is very complex and depends on expe- rience, how one chooses criteria, the time of prediction and the prediction horizon. In this research, we developed a novel methodology for disease risk classiﬁcation using retrospective data [1], [2]. The algorithm is based on the principles of DBA, DTW and additional statistical methods. In the training phase, we merged all the patients’ features data by creating DBA in a positive and negative sense. In the validation phase, we used validation data to validate the precision of the classiﬁcation. If the sample ”is positive and it is classiﬁed as positive, it is counted as a true positive (TP); if it is classiﬁed as negative, it is considered as a false negative (FN)” [3]. If the sample ”is negative and it is classiﬁed as negative it is considered as true negative (TN); if it is classiﬁed as positive, it is counted as false positive (FP)” [3]. Recall of a classiﬁer ”represents the positive correctly classiﬁed samples to the total number of positive samples” [3]. Precision ”represents the proportion of positive samples that were correctly classiﬁed to the total number of positive predicted samples” [3]. At the ﬁnal step, all results by precision and recall for different labeled data are compared. We also show the high impact of the different formulation of disease criteria (differently labeled data) on performance. The paper is organized as follows: In Section I, we describe related problems and the clinical challenge of sepsis identi- ﬁcation. Furthermore, we review existing research studies on different approaches of combination of statistical and Machine Learning (ML) approaches. In Section III, we explain the algorithm design and methods used in our research, where we describe the proposed methodology including data acquisition and pre-processing. Section IV proposes our methodology, where we describe the algorithm design architecture. Section V illustrates and compares the numerical results for different training data according to the different criteria for the sepsis disease. Finally, a conclusion is given in Section VI. II. RELATED WORKS The mortality rates due to some diseases like, for example heart diseases or sepsis [4], are very high worldwide, so risk prediction plays a very important role. Their diagnosis requires a lot of experience, time and knowledge. For example, authors in research work [5] evaluated the relative validity of sepsis identiﬁcation criteria in a large database with intensive care unit patients. The monitoring and clinical challenge of sepsis identiﬁcation is also presented by [6]. The authors in [7] developed a novel traumatic sepsis score (TSS) whose validation results allow a reliable prediction of the sepsis risk. Furthermore they constructed a model using logistic regression based on a LASSO analysis. The authors in [8] used a statistical approach where they tried to derive and internally validate the sepsis risk score to predict future sepsis events. Using recorded vital signs and results of lab values from blood tests, C-statistic models and software-aided risk scores for