© 2021 IJRAR August 2021, Volume 8, Issue 3 www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
IJRAR21C1604 International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org 689
AN EXTENSIVE STUDY ON MACHINE
LEARNING METHOD BASED CODE CLONE
DETECTION TECHNIQUES
S. Karthik
1
1
Research Scholar, Department of Computer Science,PSG College of Arts and Science (Autonomous), Coimbatore, Tamilnadu, India.
Dr. B. Rajdeepa
2
2
Associate Professor and Head, Department of Information Technology, PSG College of Arts and Science(Autonomous),Coimbatore,
Tamilnadu, India.
Abstract :Code fragments are reused by software developers through copying pasting with or without slight modifications. As a
consequence in software systems, code sections also include very similar sections known as code clones. Code cloning can be harmful in
software evolution and maintenance. Additionally, duplicated fragments will greatly increase the amount of work required when adapting or
improving code. Various software engineering processes including software evaluation analysis, code quality analysis, plagiarism detection,
program understanding, aspect mining, copyright infringement investigation, code compaction and Bug detection can necessitate the
extraction of code fragments that are semantically or syntactically identical, making clone detection an important and valuable software
analysis process. Various clone detection methods have been proposed over the last decade. In this article, an adequate comprehension of
the text, token, tree, Program Dependency Graph (PDG) and machine learning based clone detection techniques. Also, their benefits and
limitations are analyzed in a tabular form. Based on the analysis, future direction towards the clone detection is suggested for better software
development.
IndexTerms - Software development, clone code, clone code detection, duplicated fragments .
1. INTRODUCTION
The segment of code typically happen because of copying from a location and then they are rewritten into a new part of code, with or
without modifications is called as software cloning [1-2] and the copied code is referred to as clone. Different studies discovered a
duplication of code of over 20-59%. The dilemma is that a bug contained in the actual must be examined for the same flaw in each copy. In
addition, the copied code extends the work needed to incorporate the code. The analysis of the consistency of the code, duplication
identification, facet mining, virus recognition and bug disclosure are also the activities of software design that involve syntactically or
semantically similar code to be mined to allow meaningful clone detection to be carried out in software analytics. Generally, clones are
created purposely or un-purposely. Baseline clones can be either purposeful or unpurposeful. When clones become a hindrance to software
maintenance, they can be deleted or refectory. Clone detection is one of the main issues to concentrate on, as software cloning has emerged
as an effective area of research. [3].
Clone classifications are used in expansion reengineering and detection methods. Exact or approximate (based on the form and volume
of duplication), contiguous or non-contiguous (based on the contiguity of matching programme elements), maximal or subsumed (based on
the size of the detected clone pair), and so on are some of the classifications used for code clones. E Groups of clones include exact (Type
1), renamed or parameterized (Type 2), near miss (Type 3) and semantic clones (Type 4). The exact clones seem like actual code with
differences in commentaries and blank spacing. The differences in variables, literal names, keywords are the main factors to generate Type2
clone. Statement insertion, variation, and deletion are used to create near miss clones from base code. The function or action of the clone in
semantic clones remains the same, even the software coding or syntax is different.
Various strategies for detecting code clones have been established over time by taking into account clone management attempts [4-5].
The techniques are graded according to the data, representation, and algorithms that they employ. Text, token, tree-based, PDG, and
machine learning based clone detection are the most common techniques. The main goal of this article is to gain an understanding of the
current research in the field of clone detection and to recognize research gaps in terms of merits and demerits to address. It will also aid in
the selection of appropriate techniques for code clone detection, as the article includes a comparative study of different techniques based on
various parameters. The following is how the rest of this article is organized: The second section discusses the most current techniques for
detecting code clones. Section 3 summarizes the entire survey and addresses the survey's potential reach.