CodeCompass: An Open Sofware Comprehension Framework for Industrial Usage Zoltán Porkoláb, Tibor Brunner Eötvös Loránd University Budapest, Hungary [gsd,bruntib]@caesar.elte.hu Dániel Krupp, Márton Csordás Ericsson Hungary Ltd. Budapest, Hungary [daniel.krupp,marton.csordas]@ericsson.com ABSTRACT CodeCompass is an open source LLVM/Clang-based tool developed by Ericsson Ltd. and Eötvös Loránd University, Budapest to help the understanding of large legacy software systems. Based on the LLVM/Clang compiler infrastructure, CodeCompass gives exact information on complex C/C++ language elements like overload- ing, inheritance, the usage of variables and types, possible uses of function pointers and virtual functions - features that various existing tools support only partially. Steensgaard’s and Andersen’s pointer analysis algorithms are used to compute and visualize the use of pointers/references. The wide range of interactive visual- izations extends further than the usual class and function call dia- grams; architectural, component and interface diagrams are a few of the implemented graphs. To make comprehension more exten- sive, CodeCompass also utilizes build information to explore the system architecture as well as version control information. CodeCompass is regularly used by hundreds of designers and developers. Having a web-based, pluginable, extensible architecture, the CodeCompass framework can be an open platform to further code comprehension, static analysis and software metrics eforts. The source code and a tutorial is publicly available on GitHub, and a live demo is also available online. KEYWORDS code comprehension, C/C++ programming language, software vi- sualization ACM Reference Format: Zoltán Porkoláb, Tibor Brunner and Dániel Krupp, Márton Csordás. 2018. CodeCompass: An Open Software Comprehension Framework for Industrial Usage. In ICPC ’18: 26th IEEE/ACM International Conference on Program Comprehension , May 27ś28, 2018, Gothenburg, Sweden. ACM, New York, NY, USA, Article 4, 9 pages. https://doi.org/10.1145/3196321.3197546 1 INTRODUCTION The maintenance of large, long-existing legacy systems is trouble- some. During the extended lifetime of a system the code quality is continuously eroding, the original intentions are lost due to the fuctuation among the developers, and the documentation is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. ICPC ’18, May 27ś28, 2018, Gothenburg, Sweden © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5714-2/18/05. . . $15.00 https://doi.org/10.1145/3196321.3197546 getting unreliable. Especially in the telecom industry, high relia- bility software products, such as IMS signaling servers [1] have typically been in use for 20ś30 years [2, 3]. This development land- scape has the following peculiar characteristics: i) the software needs to comply to large, complex and evolving standards; ii) has a multiple-decade long development and maintenance life-cycle; iii) is developed in large (100+ heads) development organization; iv) which is distributed in multiple countries and; v) transfers of devel- opment responsibility occur from one site to the other occasionally. However, this software development landscape is not unique to the telecom industry and our observations can be applied at other industries, such as fnance, IT platforms, or large-scale internet applications; all areas where complex software is developed and maintained for long time. It is well-known, that in such a design environment, development and maintenance becomes more and more expensive. Prior to any maintenance activity ś new feature development, bug fxing, etc. ś programmers frst have to locate the place where the change applies, have to understand the actual code to see what should be extended or modifed, and have to explore the connections to other parts of the software to decide how to interact in order to avoid regression. All these activities require an adequate understanding of the code in question and its certain environment. Although, ideally the executor of the activity has full knowledge about the system, in practice this is rarely the case. In fact, programmers many times have only a vague understanding of the program they’re going to modify. A major cost factor of legacy systems is the extra efort of comprehension. Fixing new bugs introduced due to incomplete knowledge about the system is also very expensive, both in terms of development cost and time. As the documentation is unreliable, and the original design in- tentions are lost during the years and due to the fuctuation among the developers, the only reliable source of the comprehension is the existing code base. Development tools are not performing well in the code compre- hension process as they are optimized for writing new code, not for efectively browsing existing one. When creating new code, the programmer spends longer time working on the same abstraction level: e.g. defning class interfaces, and later implementing these classes with relationships to other classes. When one is going to un- derstand existing code it is necessary to jump between abstraction levels frequently: e.g. starting from a method call into a diferent class we have to understand the role of that class with its complete interface, where and how that class is used, then we must drill down into the implementation details of an other specifc method. Accordingly, when writing new code a few fles are open in parallel