An Ensemble Approach to Detect Code Comment Inconsistencies using Topic Modeling Fazle Rabbi, Md. Nazmul Haque, Md. Eusha Kadir, Md. Saeed Siddik and Ahmedul Kabir Institute of Information Technology University of Dhaka, Bangladesh Email: {bsse0725, bsse0635, bsse0708, saeed.siddik, and kabir}@iit.du.ac.bd Abstract—In modern era, the size of software is increasing, as a result a large number of software developers are assigned into software projects. To have a better understanding about source codes these developers are highly dependent on code comments. However, comments and source codes are often inconsistent in a software project because keeping comments up-to-date is often neglected. Since these comments are written in natural language and consist of context related topics from source codes, manual inspection is needed to ensure the quality of the comment associ- ated with the corresponding code. Existing approaches consider entire texts as feature, which fail to capture dominant topics to build the bridge between comments and its corresponding code. In this paper, an effective approach has been proposed to automatically extract dominant topics as well as to identify the consistency between a code snippet and its corresponding comment. This approach is evaluated with a benchmark dataset containing 2.8K Java code-comment pairs, which showed that proposed approach has achieved better performance with respect to the several evaluation metrics than the existing state-of-the-art Support Vector Machine on vector space model. Index Terms—Source Code, Code Comment, Topic Modeling, Software Artifact Analysis I. I NTRODUCTION Code comments with its corresponding source code are the main artifact of any software systems. For the management of software evolution and maintenance, developers provide comments with a code fragment which give insightful infor- mation about a software system. Comments are very important as they are more natural, descriptive and easy to understand than source code [1], [2]. In large projects, new developers are highly dependent on code comments to understand its corresponding source codes. Researchers found that code and comments evolve over time [3] and this evolved codes and comments become inconsistent to each other. Because of changing codes frequently and keeping corresponding com- ments same, comments become invalid or inconsistent with corresponding source code. Tracking the inconsistency of source code and its comment, several diverse approaches have been proposed. Where most of the approaches apply Information Retrieval (IR) techniques to collect lexical information with the assumption that the textual information of source code and comment are same. However, that assumption can be violated [4] in several cases, for example, the vocabulary developers use to write source DOI reference number: 10.18293/SEKE2020-062. code can be different from the vocabulary of comment (e.g. synonym). Nevertheless, there is no sufficiently rich litera- ture to track this inconsistency because of lacking standard datasets. A benchmark dataset has been provided [5] with a proposal to measure the coherence between source code and comment. Lexical similarity has been collected by using Vector Space Model to classify the text using tf-idf [6] and finally the code-comment inconsistency is measured using Support Vector Machine (SVM). However, this approach uses all of the vocabulary as features which can take a huge execution time. By analyzing existing literature, some insights of source code and comments have been found, which are concluded below as the research direction in this domain. • A single word (topic) is more important than a large number of similar words (features). For example, if a bag of words is found from a java method like, “dropdown”, “chrome”, “menu”, “http” or “browser”, a topic related to “browser” can represent these words. • The size of comments is less than the size of source code. So, the source code and comment need to be represented into a fixed-sized common topic. • Synonymous words have been chosen by developers while writing comment with respect to source code. So, to capture the semantic information between source code and comment, the vocabulary information needs to be incorporated. To capture these insightful information, several Research Questions (RQ) have been raised to propose an efficient inconsistency detection approach, which are listed below. • RQ1: How to comprehend the insight meaning of a code and comment pair? • RQ2: How to measure the relation between the code and comment pair? We focused on the above research questions as our objec- tives and tried to answer them throughout the newly proposed code comment inconsistency detection technique. This paper proposes an automated approach to identify the inconsistency of source code with its respective comments. The breakdown of the contributions of this paper are listed as follows. • Datasets are pre-processed to capture more meaningful information about source code and comments, e.g., de-