A Time Series Classification Method for Behaviour-Based Dropout Prediction Haiyang Liu, Zhihai Wang School of Computer and Information Technology Beijing Jiaotong University Beijing, China {haiyangliu, zhhwang}@bjtu.edu.cn Phillip Benachour, Philip Tubman School of Computing and Communications Lancaster University Lancaster, UK {p.benachour, p.tubman}@lancaster.ac.uk AbstractStudents’ dropout rate is a key metric in online distance learning courses such as MOOCs. We propose a time- series classification method to construct data based on students’ behavior and activities on a number of online distance learning modules. Further, we propose a dropout prediction model based on the time series forest (TSF) classification algorithm. The proposed predictive model is based on interaction data and is independent of learning objectives and subject domains. The model enables prediction of dropout rates without the requirement for pedagogical experts. Results show that the prediction accuracy on two selected datasets increases as the portion of data used in the model grows. However, a reasonable prediction accuracy of 0.84 is possible with only 5% of the dataset processed. As a result, early prediction can help instructors design interventions to encourage course completion before a student falls too far behind. Keywords: online distance learning; MOOCS; dropout prediction; time series; student interaction and behavior. I. INTRODUCTION The rapid emergence of Massive Online Open Courses (MOOCs) have demonstrated a significant impact on open education and enabled higher education institutions and organisations to develop different models of course dissemination and learner participation. Further, MOOCs have generated interests from researchers in data analytics and the education research fields to name a few. Over 100 higher ranked academic institutions partner with MOOC platforms to provide free education [1]. Nowadays, MOOCs have become involved with big data as the number of MOOC students increase. A course that has the potential to generate user data on a daily basis can reach millions of records in a few months [2]. Many of the higher education institutions and organisations make use of data analytics to provide indicators for policy makers, practitioners as well as valuable insights to teachers. Researchers from emerging educational fields such as learning analytics and educational data mining, attempt to make sense of the huge datasets from MOOC provider’s e, g., Coursera, Edx, FutureLearn. These large datasets provide an opportunity to detect differences in user behaviour which can be correlated with students’ performance. We notice that there are two main differences between MOOCs and traditional courses. First, unlike traditional courses students enrolled on MOOCs often show a much wider range of goals and engagement styles such that many lack the motivation to complete the course. Consequently, MOOCs tend to show a very high dropout rate [3]. This in turn can motivate researchers to try and understand the reasons for the high dropout rates; hence retention prediction can be an important aspect in a MOOC environment. Early prediction can help instructors design interventions to encourage course completion before a student falls too far behind [4]. A second difference is that universities disseminating MOOC courses are less likely to collect detailed information about their students e. g., demographics, residency, and previous academic achievements. As a result, studentsinteraction behaviour with the learning platform is the only source of data that is available from which to form a predictive model until course examinations have been completed [5]. Any form of sequential data in daily life can be thought of as time series data. Time series data can be found in a wide variety of scenarios like finance, medicine, agriculture, as well as MOOC platforms. As interaction data (e.g. clickstream data) between learners and resources provided by MOOCs, collected over time, it can be seen as time series data, making it possible to utilize time series data mining techniques to deal with data analysis in MOOCs. Time series classification (TSC) problems are differentiated from traditional classification problems because the attributes are ordered. The important characteristic is that there may be discriminatory features dependent on the ordering [6]. In this way, time series classification algorithms can work as powerful tools to reveal the learners’ interaction patterns that correlate with the probability of dropout. This paper focuses on developing a dropout predictive model based on collecting studentsbehaviour and MOOC interactions data using a time series classification method. The authors use interaction data because sometimes demographic data of learners is not fully collected in