Efficient Token Based Clone Detection with Flexible Tokenization Hamid Abdul Basit National University of Singapore Department of Computer Science 3 Science Drive 2, Singapore 117543 (+65) 6516 1184 h.abdul.basit@gmail.com Simon J. Puglisi Curtin University of Technology Department of Computing 2nd line of address Telephone number, incl. country code puglissj@computing.edu.au William F. Smyth McMaster University Department of Computing and Software Telephone number, incl. country code smyth@mcmaster.ca Andrew Turpin RMIT University School of Computer Science and Information Technology Telephone number, incl. country code aht@cs.rmit.edu.au Stan Jarzabek National University of Singapore Department of Computer Science 3 Science Drive 2, Singapore 117543 (+65) 6516 2863 stan@comp.nus.edu.sg ABSTRACT Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Analysis and experiments show that our clone detection is simple, flexible, precise, scalable, and performs better than the previous well-known tools. Categories and Subject Descriptors D.2.7 [Software Engineering] Maintenance - Restructuring, reverse engineering, and reengineering I.5.3 [Pattern recognition] Clustering General Terms Algorithms, Measurement, Performance, Design, Experimentation, Languages, Verification. Keywords Clone detection, software maintenance, token-based clone detection, suffix-arrays 1. INTRODUCTION Code clones, or simply clones, are code fragments of considerable length and significant similarity. Cloning is a common phenomenon found in almost all kinds of software systems. Several studies suggest that as much as 20-30% of large software systems consist of cloned code [2][25]. The presence of clones may lead to maintenance related problems by increasing the risk of update anomalies. Detection of clones provides several benefits in terms of maintenance, program understanding, reengineering and reuse [21]. Several tools and techniques have been proposed for the detection of clones [2][16][5][10][15][18][19][22][20]. The differentiating factors between these approaches are the code representation, the clone matching techniques and the granularity of the detected clones. Token based code representation provides a suitable abstraction for clone detection. It has both ease of adaptability to different languages, and awareness and control of the underlying language tokens. Comparative studies [6] involving different clone detection techniques have shown that token based clone detection tools perform well in terms of precision and recall of the detected clones. However, manipulating all tokens in large software systems is computationally very expensive. Efficient data structures and matching algorithms can help mitigate this problem to make the technique scalable even for very large scale systems of multi-million lines of code. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00. 1