Discovery of Student Strategies using Hidden Markov Model Clustering Benjamin Shih Machine Learning Department Carnegie-Mellon University Pittsburgh, PA 15213 shih@cmu.edu Kenneth R. Koedinger HCI Institute Carnegie-Mellon University Pittsburgh, PA 15213 koedinger@cmu.edu Richard Scheines Department of Philosophy Carnegie-Mellon University Pittsburgh, PA 15213 scheines@cmu.edu Abstract Students interacting with educational software generate data on their use of soft- ware assistance and on the correctness of their answers. This data comes in the form of a time series, with each interaction as a separate data point. This data poses a number of unique issues. In educational research, results should be in- terpretable by domain experts, which strongly biases learning towards simpler models. Educational data also has a temporal dimension that is generally not fully utilized. Finally, when educational data is analyzed using machine learning techniques, the algorithm is generally off-the-shelf with little consideration for the unique properties of educational data. We focus on the problem of analyz- ing student interactions with software tutors. Our objective is to discover differ- ent strategies that students employ and to use those strategies to predict learning outcomes. For this, we utilize hidden Markov model (HMM) clustering. Unlike some other approaches, HMMs incorporate the time dimension into the model. By learning many HMMs rather than just one, the result will include smaller, more interpretable models. Finally, as part of this process, we can examine different model selection criteria with respect to the models’ predictions of student learn- ing outcomes. This allows further insight into the properties of model selection criteria on educational data sets, beyond the usual cross-validation or test analysis. We discover that the algorithm is effective across multiple measures and that the adjusted-R 2 is an effective model selection metric. 1 Introduction Educational software is an increasingly important part of human education. Many schools use ed- ucational software as a major component in classroom curricula and individuals are using special- ized software for diverse purposes such as second-language acquisition and extracurricular tutoring. Likewise, the analysis of data from educational software is also a growing field. Individuals in- teracting with an educational system generate sizable quantities of time-stamped data, ranging in granularity from individual mouse movements to attempted solutions. This data offers insight into an individual’s underlying cognitive processes and has the potential to guide future educational in- terventions. However, the temporal-sequential aspect of educational data is frequently underutilized. In brief, the usual approach to analyzing educational data is to compute a set of features, e.g. average number of attempts, and to then input those features into an off-the-shelf machine learning algorithm in an attempt to predict learning between separately administered pre-tests and post-tests. These features usually do not incorporate a significant temporal aspect aside from the student’s response time, i.e. the time between a stimulus, such as a problem statement, and the response, such as a solution 1