An Efficient New Multi-Language Clone Detection
Approach from Large Source Code
Saif Ur Rehman, Kamran Khan
Department of Computer Science
Shaheed Zulifiqar Ali Bhutto Institute of Science and
Technology (SZABIST)
Islamabad, Pakistan
saifi.ur.rehman@gmail.com, Kamran_3388@yahoo.com
Simon Fong, Robert Biuk-Aghai
Department of Computer and Information Science
Faculty of Science and Technology
University of Macau
Macau SAR, China
ccfong@umac.mo, robertb@umac.mo
Abstract— In software engineering, the concept of code reuse is
very common. Code reuse is the concept of copying and pasting
the code in multiple places in the same software or different
software without modification. This practice may reduce
software maintainability and give rise to serious maintenance
problems. In the last few decades numerous code clone detection
techniques and tools have been proposed for capturing
duplicated redundant code. Each of these techniques attempts to
find out the duplicated code, which is also known as software
clone. These techniques include Kclone, CP-Miner, CC-Finder,
CReN etc. The objective of those researches is the exploration of
various clone detection techniques and tools proposed so far. In
this study, we propose an efficient clone detection technique
which is used to detect clones in various programming languages.
We have endeavored to improve performance and overcome the
key problem of detecting clones in only one language. The
proposed technique has been evaluated using two-dimensional
array which has exhibited a faster method of storing and
identification of clones in source code. We are also working on
some of its future directions including the removal of the clones
detected from the source code.
Keywords-software engineering, collaborative programming,
code reuse, code clone detection techniques
I. INTRODUCTION
In software development, programmers often use the copy-
paste technique to reuse program code in order to reduce
development time. A software developer, by frequent use of
copy-paste, may use the same code over and over again. The
copy-paste technique reduces programming effort and time.
Therefore programmers often prefer it over writing new code
from scratch. In the literature many methods have been
developed for detecting duplicated code originating from copy-
and-paste in software. For example [1] and [2] use a copy-paste
detection tool for detecting code clones. The main issue
associated to clones in programs is that copy-paste introduces
bugs in programming code due to forgetting to change
identifiers each time throughout the code that was pasted from
the source [2].
There are many issues associated with copy-paste source
code when the size of the code gets bigger; furthermore
handling these issues is an even greater challenge. A bug in one
module is reproduced in every copy [3]. As many of the copy-
paste codes are not documented and there is no record of where
these codes are placed, it is extremely hard to find and fix such
programming bugs. These bugs are the main source of issues
related to maintenance of existing software and removing such
bugs is complex and costly. Moreover, understanding and
reusing such code is also a challenge for programmers,
reducing the level of abstraction and adding new functionality
to the code [3].
Different research studies have already been carried out to
identify duplicated code in software applications [12]. However,
these techniques have limitations regarding the support for
certain programming languages. In the literature different tests
have been performed on known tools and techniques for clone
detection but the results reveal that there is no good approach
that produces efficient and optimal output.
In software engineering the topic of clone detection has
received much attention over the past decades. In the literature
several methods for clone detection have been proposed. These
techniques are widely used in the software domain [8]. Existing
clone detection techniques focus on finding similar code in the
source code, known as clone, which results in reduced update
issues and application size. These gains, however, can be
improved by evaluating the level of clone analysis [4]. Previous
studies showed that these gains can detect design level
similarities which can aid software design in terms of code
optimization and understanding of the design.
Today’s software projects are almost always collaborative
efforts. In the case of open source projects it is possible for
hundreds, or even thousands, of programmers to collaborate on
the development of a piece of software. When these
programmers copy and paste each other’s software code, the
problems associated with code cloning are exacerbated as the
ones cloning others’ codes are unaware of any bug fixing
which those developers do on the original code. Thus software
bugs persist in the clones long after the original codes have
been corrected.
The remainder of this paper is organized as follows. Section
2 presents a literature review of clone detection techniques. In
section 3 we introduce a new clone detection technique called
LSC Miner. Experimental details are described in Section 4,
2012 IEEE International Conference on Systems, Man, and Cybernetics
October 14-17, 2012, COEX, Seoul, Korea
978-1-4673-1714-6/12/$31.00 ©2012 IEEE
937