Software Cloning Detection Techniques: Comparison Criteria A. Baqais 1 , M. Ahmed 2 1 College Of Computer Science and Engineering, KFUPM, Dhahran, Saudi Arabia 2 College Of Computer Science and Engineering, KFUPM, Dhahran, Saudi Arabia Abstract - Cloning code is becoming an increasing activity done by several programmers especially when there is a tight schedule to finish their tasks. The difficulty of detecting clone code lies in the intentional modification of some segments of the code by the programmers which may result in difficult-to-track debugs and increasing cost of maintenance. There have been different approaches and techniques proposed in the literature to solve this issue. However, these approaches either target a specific aspect of the issue or biased to some criteria rather than others. A comparison study of these techniques –based on some criteria- is proposed by this paper favoring approaches borrowed from artificial intelligence discipline. A final analysis is provided to layout the foundation for a new proposed solution that outweighs previous approaches. Keywords: Software cloning, Code duplication, Code cloning, Code Similarity. 1 Introduction Software cloning is an active research area in the field of software engineering. A considerable amount of papers has been published to address this issue from different perspectives. Some papers discuss the implication and the sequences of software cloning while others devote large sections to devise some techniques to identify the cloning fragments. Moreover, some papers provide a framework to compare the different techniques, tools and approaches targeting this domain. Though it is considered as bad practice, Code duplication is quite popular in industrial software for many reasons. Due to the pressure of meeting deadlines, many programmers opt to copy some snippets of code and paste them somewhere in their program. Another reason is that the original code may have been fully tested and validated and as such some developers intentionally prefer to duplicate them especially when the code segment has advanced algorithms that consider different branches and computations. A third reason resides in the skills and the capability of the development team. Fresh or junior programmers tempt to duplicate a method or class if they feel they don’t have the necessary programming skills to code it themselves. Moreover, some code sections are not really intentionally duplicated, it’s just the similar construct across different programming languages or the accidental duplication of functionality makes software cloning tools detect them as duplicated. For example, two for loops could be detected as clone segment even though it computes two different functions. As figure 1 illustrates, these are two functions that are mainly calculating the area of 10 objects and the square of10 numbers. Clone detection will detect these as a cloning candidate code because they have almost the same number of lines, the same iterative variables; they only differs in the name of the returning variable. This is reported in the literature as false positive, that is, fragment of codes that look similar but actually semantically different and can’t be classified as clones. Researchers show strong interest in studying duplicated code because it helps in refactoring, evaluating code quality or reveal hidden bugs. Refactoring refers to the activity of reconstruction code structure without altering its intended behavior which conceptually similarly to software cloning where different codes perform the same function with different code structure. Code duplication is a strong indication of a design flaw and affects the code quality since it hinders other design techniques (such as abstraction or inheritance) of being implemented. In addition, duplicated code exhibits the same errors that its original has. Hence, a bug in an original code will be transferred to all duplicated code. For example, using CP-Miner has uncovered 28 bugs in Linux and 23 in FreeBSD[14]. It has been assorted that cloning increases maintenance time. Readability [1] is an interesting issue of studying software cloning. Another researcher Says [2] that duplicating code make it difficult to be understood, while states that it helps to understand the system since it provides sufficient information about the domain. There is no contrary in the above two views. The practice of code duplication reduces the readability of the program per se; however, it gives information on the system as a whole since it points out to important segments of code where duplication occurs more frequently.