NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization Chanchal K. Roy and James R. Cordy School of Computing, Queen’s University Kingston, ON, Canada K7L 3N6 {croy, cordy}@cs.queensu.ca Abstract This paper examines the effectiveness of a new language- specific parser-based but lightweight clone detection ap- proach. Exploiting a novel application of a source trans- formation system, the method accurately finds near-miss clones using an efficient text line comparison technique. The transformation system assists the method in three ways. First, using agile parsing it provides user-specified flexi- ble pretty- printing to remove noise, standardize formatting and break program statements into parts such that poten- tial changes can be detected as simple linewise text differ- ences. Second, it provides efficient flexible extraction of po- tential clones to be compared using island grammars and agile parsing to select granularities and enumerate poten- tial clones. Third, using transformation rules it provides flexible code normalization to allow for local editing differ- ences between similar code segments and filtering out of un- interesting parts of potential clones. In this paper we intro- duce the theory and practice of the framework and demon- strate its use in finding function clones in C code. Early experiments indicate that the method is capable of finding near-miss clones with high precision and recall, and with reasonable performance. 1. Introduction Copying a code fragment and reusing it by pasting with or without minor modifications is a common practice in software development environments. As a result software systems often have sections of code that are similar, called software clones or code clones. Previous research shows that a significant amount of code (between 7% to 23%) of a software system is cloned code [3, 5, 22, 26]. While pro- grammers often practise cloning with clear intent [23] and it is beneficial in certain situations [21], one of the major difficulties with such duplicated fragments is that if a bug is detected in a code fragment, all the fragments similar to it should be investigated to check for same bug [25]. More- over, when enhancing or adapting a piece of code, dupli- cated fragments can multiply the work to be done [19]. From a program comprehension point of view, clones carry important domain knowledge and thus studying the clones in a system can assist in understanding it [19]. More- over, by refactoring the clones detected, one can potentially improve understandability, maintainability and extensibil- ity, and reduce the complexity of the system [15]. Fortunately, several (semi-)automated techniques for de- tecting code clones have been proposed (c.f., Section 11). Several studies show that lightweight text-based techniques can find clones with high accuracy and confidence, but de- tected clones often do not correspond to appropriate syntac- tic units [7, 30]. Parser-based syntactic (AST-based) tech- niques, on the other hand, find syntactically meaningful clones but tend to be more heavyweight, requiring a full parser and subtree comparison method. Moreover, neither text-based nor parser-based techniques have been found to be effective in detecting near-miss clones [7]. In this paper, we propose a multi-pass approach which is parser-based and language-specific but reasonably lightweight, using simple text line rather than subtree com- parison to achieve good time and space complexity. We ex- ploit the benefits of TXL [9] to efficiently identify and ex- tract potential syntactic clones with pretty-printing to elim- inate formatting differences and noise. TXL’s agile parsing [11] allows us to flexibly select granularity, and to tune the pretty-printing of potential clones to introduce additional line breaks such that potential variances within statements and other structures can be accurately reflected using a sim- ple text line comparison. TXL’s transformation rules allow us to add flexible code normalization and filtering of unin- teresting or irrelevant sections in the potential clones, yield- ing accurate minimal differences that are easily traced back to original source using source coordinates. Our approach is lightweight in the sense that, like other text-based techniques (e.g., Duploc [13]), we work directly on program source text. Although pretty-printing, code normalization and filtering all use TXL’s agile parsing and