1 Using CBR techniques to detect plagiarism in computing assignments Pádraig Cunningham Department of Computer Science, Trinity College Dublin Alexander N. Mikoyan Dept. of Mathematics and Theoretical Mechanics, Moscow State University Abstract . The problems of case retrieval in CBR and plagiarism detection have in common a need to detect close but not exact matches between exemplars. In this paper we describe a plagiarism detection system that has been inspired by ideas from CBR research. In particular this system can detect similarities between programs without performing exhaustive comparisons on all exemplars. Our analysis of similarity in this well controlled domain offers some insights into the kinds of profiles that can be used in similarity assessment in general. We argue that the choice of a perspicuous profile is crucial to any classification task and determining the best predictive features may require significant analysis of the problem domain. 1 Introduction The problem of detecting plagiarism in computing assignments depends on being able to identify similar programs in large populations. This emphasis on similarity, on identifying close matches, is reminiscent of the problem of case retrieval in CBR. In this paper we will concentrate on the application of CBR techniques in Cogger * , a system for detecting plagiarism. We will discuss what this novel domain informs us about retrieval in CBR and about the automatic assessment of similarity in general. Our considerations on similarity in this well controlled domain offer some insights into the alternatives of statistical and knowledge based classification. Since the idea of similarity can be considered along several dimensions it is often difficult for humans to agree on when cases, or programming assignments, are similar. In this research the programs under consideration have complicated structure and programs are considered to be similar if their function call structure is similar. This involves the determination of the similarity of function call trees; the mechanisms we use are described in Appendix I. Before examining the problem of plagiarism for a CBR perspective we will introduce some theoretical issues in CBR that are relevant. In section 3 we discuss similarity in general and in section 4 we consider the issue of problem representation that must be considered before any similarity can be determined. We believe that a basic tenet of the majority of CBR research is that similar cases can be retrieved from the case-base inexpensively; in section 5 we consider what kinds of representations are required to support this. 2 Theoretical Issues Currently in AI there is a view that knowledge representation is unsuccessful and knowledge acquisition is fraught with problems. Consequently there is a move towards an AI paradigm that avoids these issues. This new AI is based on statistics and weights rather than symbolic knowledge representation [1]. The current popularity of connectionism is evidence of this. Closer to CBR, Memory Based Reasoning (MBR) is an approach to AI that wishes to avoid knowledge acquisition and domain modelling [2]. The great attraction of neural networks and MBR is the contention that expert performance can be achieved without knowledge level analysis of the problem domain. This is in sharp contrast with the conventional view in AI; the view that "In the knowledge lies the power" and the knowledge must be represented explicitly. CBR is a methodology that can serve both of these paradigms. Case-Based Reasoning systems can be information theoretic or knowledge-based. CBR systems for simple tasks like diagnosis or property valuation can be set up with little analysis of the problem domain. At the other end of the spectrum systems for more complex tasks like design require a complex domain model in order to process retrieved cases. The main theme of this paper is the implications that these issues have on determining similarity in case retrieval. Is it possible to establish the similarity of two cases in a system that does not have a strong domain model? How far can we go with shallow index features in case retrieval? To this end we will analyse similarity in the context of detecting plagiarism in computing assignments. This is not really a CBR problem but we will argue that the issues of similarity are the same nonetheless. * "Cogging" is an anglo-irish slang word for copying homework or other exercises.