Understanding the Effects of Code Clones on Modularity in Software Systems Liguo Yu Computer Science and Informatics Indiana University South Bend South Bend, IN, USA ligyu@iusb.edu S. Ramaswamy, A. Vaidyanathan Industrial Software Systems ABB Corporate Research Center Bangalore, India srini@ieee.org Abstract—Modularity is an important software design principle. One key point in the design of high quality software products is avoiding code clones, i.e., a portion of source code that is identical or similar to another. During the software evolution process, new code segments are frequently added. It is common to see that code clones accrue with the release of new versions of the product. Such accruement could be more serious in systems software, where new code segments are associated with similar functions or similar drivers. In this paper, we study code clones of two versions of the Linux kernel from the viewpoint of modularity. Our investigation finds that although quite some effort has been spent in Linux to remove some code clones, more code clones are typically added with the release of new versions. This has become a major issue that is potentially causing degradation of the modularity design principle within the Linux kernel. Keywords-code clone, modularity, system software, Linux I. INTRODUCTION Modularity is a measure of the extent to which software systems are composed of disparate, substitutable modules; whereby each of which accomplishes a single functionality. In principle, modularity can be classified into functional modularity and architectural modularity. Functional modularity separates different functions from each other in different parts of the source code. Functional modularity enables high cohesion and low coupling by breaking the source code in to independent non-blocking modules. The principle of architectural modularity recommends dividing the software system in to separate layers with separate concerns for each of the layers. In software systems design, modularity is a widely accepted design principle [1] [2]. It has been considered together with hierarchical structure and interaction locality as the three basic rules governing the evolution of complex systems [3] [4]. In software systems, to achieve high understandability, maintainability, and reusability, the entire system should be decomposable into manageable components, such as libraries, functions, classes, and aspects. One objective of these different abstractions is to reduce source code redundancy. For example, a library function or an abstract class shall be consumed and implemented by the client components. When a software system is initially designed and developed, the underlying architectural concepts might advocate the principle of modularity. However, as it evolves to accommodate new and emergent functional and nonfunctional requirements, the modularity principle might be compromised, i.e., the existing modular design might be altered or new non- modular code might be generated. For example, to evolve an improvised protocol handler, or augment a new feature to an existing driver, a programmer might inadvertently add new function code that is same or similar to other functions, thereby creating redundant functions on this particular code segment. While the adoption of say, the strategy design pattern could make the source code extensible, without the need to add redundant functionality, such practices are often adopted only by seasoned software developers. Ensuring a maintainable codebase through modular design often is a causality of the pressures of getting to the market. Although the code may be thoroughly tested and correct from the implementation viewpoint, this is not a good practice from the design viewpoint because of such duplication and redundancy of functions, or classes. Such duplication or redundancy of code segments in a program is called code clone, in general. Formally, a code clone is defined as a portion of code in a source file that is identical or similar to another [5]. As described before, code clones are often introduced through a software evolution process, where an original portion of code is copied and pasted within the same file, or to different files [6] [7]. Due to the pressures of reaching the market, such development practices are still rampant in software development organizations across the world. Code clones defy the design principle of modularity and it makes it difficult to maintain and test programs effectively. On the one hand, if a function has a bug, such practices could result in deep proliferations of this bug throughout the system. The issue could become more serious if this function has mutated through the evolution process: even if we later detect the bug in the original function, it is hard to locate and fix all these mutated functions. The reverse, where in a bug is found in a mutated version, and back traced to the original code base, but it can still be a difficult task to track all the other mutated code bases. Furthermore, code clones can result the unnecessary ‘bulging’ of source code, which would require additional time to build, large memory to run, more effort to maintain, and significant costs to upgrade the system. Therefore, as software programs evolve over time, they might suffer from a degradation of modularity due to the accruement of code clones [8]. Nevertheless, code clones are common in evolving software systems, especially in systems software. Because functions provided by system software generally follow similar algorithms and have similar solutions, copy and paste might be considered as a normal development routine in implementing 2012 19th Asia-Pacific Software Engineering Conference 1530-1362/12 $26.00 © 2012 IEEE DOI 10.1109/APSEC.2012.49 105