An Efficient New Multi-Language Clone Detection Approach from Large Source Code Saif Ur Rehman, Kamran Khan Department of Computer Science Shaheed Zulifiqar Ali Bhutto Institute of Science and Technology (SZABIST) Islamabad, Pakistan saifi.ur.rehman@gmail.com, Kamran_3388@yahoo.com Simon Fong, Robert Biuk-Aghai Department of Computer and Information Science Faculty of Science and Technology University of Macau Macau SAR, China ccfong@umac.mo, robertb@umac.mo Abstract— In software engineering, the concept of code reuse is very common. Code reuse is the concept of copying and pasting the code in multiple places in the same software or different software without modification. This practice may reduce software maintainability and give rise to serious maintenance problems. In the last few decades numerous code clone detection techniques and tools have been proposed for capturing duplicated redundant code. Each of these techniques attempts to find out the duplicated code, which is also known as software clone. These techniques include Kclone, CP-Miner, CC-Finder, CReN etc. The objective of those researches is the exploration of various clone detection techniques and tools proposed so far. In this study, we propose an efficient clone detection technique which is used to detect clones in various programming languages. We have endeavored to improve performance and overcome the key problem of detecting clones in only one language. The proposed technique has been evaluated using two-dimensional array which has exhibited a faster method of storing and identification of clones in source code. We are also working on some of its future directions including the removal of the clones detected from the source code. Keywords-software engineering, collaborative programming, code reuse, code clone detection techniques I. INTRODUCTION In software development, programmers often use the copy- paste technique to reuse program code in order to reduce development time. A software developer, by frequent use of copy-paste, may use the same code over and over again. The copy-paste technique reduces programming effort and time. Therefore programmers often prefer it over writing new code from scratch. In the literature many methods have been developed for detecting duplicated code originating from copy- and-paste in software. For example [1] and [2] use a copy-paste detection tool for detecting code clones. The main issue associated to clones in programs is that copy-paste introduces bugs in programming code due to forgetting to change identifiers each time throughout the code that was pasted from the source [2]. There are many issues associated with copy-paste source code when the size of the code gets bigger; furthermore handling these issues is an even greater challenge. A bug in one module is reproduced in every copy [3]. As many of the copy- paste codes are not documented and there is no record of where these codes are placed, it is extremely hard to find and fix such programming bugs. These bugs are the main source of issues related to maintenance of existing software and removing such bugs is complex and costly. Moreover, understanding and reusing such code is also a challenge for programmers, reducing the level of abstraction and adding new functionality to the code [3]. Different research studies have already been carried out to identify duplicated code in software applications [12]. However, these techniques have limitations regarding the support for certain programming languages. In the literature different tests have been performed on known tools and techniques for clone detection but the results reveal that there is no good approach that produces efficient and optimal output. In software engineering the topic of clone detection has received much attention over the past decades. In the literature several methods for clone detection have been proposed. These techniques are widely used in the software domain [8]. Existing clone detection techniques focus on finding similar code in the source code, known as clone, which results in reduced update issues and application size. These gains, however, can be improved by evaluating the level of clone analysis [4]. Previous studies showed that these gains can detect design level similarities which can aid software design in terms of code optimization and understanding of the design. Today’s software projects are almost always collaborative efforts. In the case of open source projects it is possible for hundreds, or even thousands, of programmers to collaborate on the development of a piece of software. When these programmers copy and paste each other’s software code, the problems associated with code cloning are exacerbated as the ones cloning others’ codes are unaware of any bug fixing which those developers do on the original code. Thus software bugs persist in the clones long after the original codes have been corrected. The remainder of this paper is organized as follows. Section 2 presents a literature review of clone detection techniques. In section 3 we introduce a new clone detection technique called LSC Miner. Experimental details are described in Section 4, 2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea 978-1-4673-1714-6/12/$31.00 ©2012 IEEE 937