Fraud Detection in Statistics Education based on the Compendium Platform and Reproducible Computing Patrick Wessa K.U.Leuven Association Lessius Dept. of Business Studies Belgium Bart Baesens K.U.Leuven Association Faculty of Business and Economics Belgium Abstract This paper focuses on a newly developed method to de- tect fraud in empirical papers that are submitted by stu- dents. The proposed solution is based on the Compendium Platform and Reproducible Computing ([21], [18], [17], [20]) which allows the educator to build e-learning envi- ronments that are embedded in the pedagogical framework of social constructivism ([16], [15], [12]) and which can be shown to be effective in terms of non-rote learning of statistical concepts [19]. The paper addresses the technological aspects of the proposed fraud detection system, ways to discriminate be- tween various types of fraud (plagiarism, free riding, data tampering, peer-review cheating), and the pedagogical is- sues that result from its implementation (responsibility, non- rote learning). Finally, the first experiences about the im- plementation of the proposed technology in three under- graduate statistics courses (with a large student popula- tions) are discussed, and used to recommend paths for fu- ture research & development. Acknowledgments This research was funded by the OOF2007/13 project of the K.U.Leuven Association. I would like to thank Ed van Stee for his excellent work in programming substantial parts of the Compendium Platform. 1. Introduction In a recent editorial of the journal Research Policy the problem of plagiarism and the inability of peer review to detect plagiarism was clearly illustrated and summarized as follows [4]: The fact that academic misconduct on this scale has gone unchecked over such a prolonged period raises serious issues about the efficacy of the processes used to police the conduct of researchers. ... a measured degree of vigilance and a greater willingness to pursue any well- founded suspicions of research misconduct are required by editors, referees, publishers and the wider academic com- munity if the scourge of plagiarism is to be kept at bay. If this is true for the research community then the prob- lem of fraud detection in education is not only relevant but also very challenging. In particular, we believe that it is difficult to detect fraudulent activities that are related to sta- tistical analysis because of a variety of reasons, such as the following: • the data under investigation might not be readily avail- able • the software that is needed to verify the analysis might not be available • the computation might not be reproducible because it is obfuscated (for instance when the underlying com- putational parameters are not explicitly defined) The difficulties that we encounter to detect statistical fraud is therefore closely related to the problem of irrepro- ducible research as described in [6] and [2]. Many solu- tions have been proposed ([3],[13], [14], [1], [5], [8], [9]) but were not implemented in statistics education due to a series of practical and technical reasons [20]. With the introduction of our newly developed Repro- ducible Computing technology (which is hosted within the so-called Compendium Platform and which was built upon the R Framework for statistical computing [17]) these prob- lems have been solved [20], [19]. In addition, it is now possible to accurately measure the actual - rather than self- reported - learning activities that are related to statistical computing [20]. This is not only important to gain a better understanding of learning processes and their relationships with computing and learning technology. It is also a condi- tio sine qua non for improving fraud detection and preven- tion as will be illustrated in the following sections.