Identification of High-Level Concept Clones in Source Code Andrian Marcus, Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 amarcus@cs.kent.edu, jmaletic@cs.kent.edu Abstract Source code duplication occurs frequently within large software systems. Pieces of source code, functions, and data types are often duplicated in part, or in whole, for a variety of reasons. Programmers may simply be reusing a piece of code via copy and paste or they may be “re- inventing the wheel”. Previous research on the detection of clones is mainly focused on identifying pieces of code with similar (or nearly similar) structure. Our approach is to examine the source code text (comments and identifiers) and identify implementations of similar high-level concepts (e.g., abstract data types). The approach uses an information retrieval technique (i.e., latent semantic indexing) to statically analyze the software system and determine semantic similarities between source code documents (i.e., functions, files, or code segments). These similarity measures are used to drive the clone detection process. The intention of our approach is to enhance and augment existing clone detection methods that are based on structural analysis. This synergistic use of methods will improve the quality of clone detection. A set of experiments is presented that demonstrate the usage of semantic similarity measure to identify clones within a version of NCSA Mosaic. 1. Introduction Research suggests [3, 23] that a reasonable amount of large software systems contain duplicated implemen- tations of source code. There are a number of reasons for the existence of these duplicate implementations, or clones. For one, programmers often perform a type of ad hoc reuse by using the copy and paste method. The scenario is common; you find a piece of code in another routine that almost solves your problem. You copy it to your routine and modify it to suit the problem at hand. This type of “reuse” is less costly (at the time) than redesigning a larger part of the system to incorporate the necessary generality of the reused piece of code. Ideally, the program would create a more general set of routines or design a class hierarchy to solve the reusing problem. This represents a programmer’s explicit intent to reuse an abstraction in the problem, or solution, domain. Baxter et al. go as so far to say that we should offer tools to support this type of cloning (reuse) in a more structured and well- defined manner. The above described situation gives rise to the following types of clones. A (perfect) clone is a program fragment that is identical to another program fragment. A near miss clone is a program fragment that is very similar to another fragment. The near miss clone comes about when the programmer modifies the copied fragment. Another reason for the occurrence of clones, especially in very large software systems, is because of “re-inventing the wheel”. A developer (or maintainer) may not know of the existence of a solution to their problem and they just solve it by developing new code. Alternatively, they may know of a fragment that is similar to what they need, but feel the expense of understanding and modifying the fragment is to great in comparison to “writing it themselves”. Re-inventing the wheel gives rise to near miss clone and possibly “wide miss” clones. A wide miss clone solves the same (or nearly the same) problem but has a very different structure. While this general problem could be solved by better designs, communication among developers, or better documentation, it remains a reality. From our experience, these types of clones often manifest themselves as higher-level abstractions in the problem or solution domain. A simple example that comes to mind is an ADT list. A list structure is often duplicated in one form or another throughout a system. Each programmer, or team, builds one to suit his or her particular needs. We term these types of clones as high-level concept clones. While a number of the existing clone detection methods can detect some of these types of clones, no