Analysis and Clustering of Model Clones: An Automotive Industrial Experience Manar H. Alalfi, James R. Cordy, Thomas R. Dean School of Computing, Queen’s University, Kingston, Canada Email: {alalfi, cordy, dean}@cs.queensu.ca Abstract—In this paper we present our early experience analyzing subsystem similarity in industrial automotive models. We apply our model clone detection tool, SIMONE, to identify identical and near-miss Simulink subsystem clones and cluster them into classes based on clone size and similarity threshold. We then analyze clone detection results using graph visualizations generated by the SIMGraph, a SIMONE extension, to identify subsystem patterns. SIMGraph provides us and our industrial partners with new interesting and useful insights that improves our understanding of the analyzed models and suggests better ways to maintain them. I. I NTRODUCTION In todays automotive industry, models are widely used to generate production software code. A modern automobile may have 100 million lines or more of production software source code on board, and up to 80% of the code deployed on up to 100 embedded control units can be generated from models specified using domain-specific formalisms such as Matlab/Simulink [1]. The size and complexity of software in the embedded domain, especially in the automotive software systems, is expanding rapidly, while the innovation cycle length is decreasing with high cost pressure and large numbers of product-line variants. Consequently, software development in this domain has adopted a highly reuse-oriented approach, where general purpose domain specific libraries with elements such as PID-controllers are being reused in the manufacture of many new software components. Dealing with this com- plexity and the high frequency of software reusability requires sophisticated software tools to manage the massive amounts of information used by engineers in software development projects. This rapid growth has led to challenges that are already well known from classic programming languages. In particular, the presence of copied program elements or “clones”, which can affect productivity and software maintenance, is also manifested in models. Thus the identification of common or similar elements in different parts of the software is important to the model-based development process. To address this need, we have developed a method and toolset called SIMONE [2] that uses clone detection for the analysis and formalization of subsystem similarity in industrial models. In this paper we present our first experience in the analysis and pattern extraction of subsystem patterns in a set of production Simulink automotive models, using visualization techniques to provide insight. Our method is intended to assist reuse in model development in a number of ways: standards and consistency analysis or enforcement in model mainte- nance; failure and change propagation in model maintenance; and verification and test optimization in model testing. We envisage this work as a fundamental enabler for the future of higher level modeling, model transformations and meta- modeling frameworks, customized to specific domains. II. APPROACH This paper describes our first experience in applying our analysis method in an empirical study of an set of actual production models from our industrial partners at General Motors, using pattern mining and clone detection technologies to discover a catalog of repeated subsystem patterns. We have automated much of this first step using subsystem clone classes from our tool SIMONE, a near-miss clone detector for Simulink models [2], from which we derive a first ap- proximation of the pattern set. Our plan is to organize these discovered patterns into a taxonomy with the goal of covering all of the subsystem patterns in the models. In this paper we introduce SIMGraph, a graph visualization extension for SIMONE results, which helps us to visualize and understand Simulink subsystem clones and patterns in a more intuitive and understandable way. In the following sections we will discuss in more detail the phases of our analysis, and our case study analyzing a set of industrial automotive Simulink models. III. CLONE I DENTIFICATION Our analysis consists of three phases. In the first, “discov- ery” phase of our analysis, our primary goal is the discovery and identification of common subsystem patterns in an exam- ple production model set obtained from our industrial partners at General Motors. For that purpose, we have used our previously-developed method for leveraging text-based code clone detectors to find near-miss clones in graphical models, and have demonstrated SIMONE [2], an implementation for Simulink models based on the NICAD clone detector [3]. In that work, we outlined the challenges of using a parser- and text-based method on graphical models, described our solutions using filtering and sorting of the textual representation, and compared our results to a state-of-the-art graph-based method, showing that our near-miss detection can find meaningful subsystem clones that graph-based methods can miss. Our approach generalizes to 978-1-4799-3752-3/14 c 2014 IEEE CSMR-WCRE 2014, Antwerp, Belgium Industry Track Accepted for publication by IEEE. c 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 375