Mining Source Code Descriptions from Developer Communications Sebastiano Panichella † , Jairo Hernan Aponte Melo ‡ , Massimiliano Di Penta † , Andrian Marcus ⋆ , Gerardo Canfora † † Dept. of Engineering-RCOST, University of Sannio, Italy ‡ Universidad Nacional de Colombia, Bogota, Colombia ⋆ Wayne State University, Detroit USA Abstract—Very often, source code lacks comments that ade- quately describe its behavior. In such situations developers need to infer knowledge from the source code itself, or to search for source code descriptions in external artifacts. We argue that messages exchanged among contribu- tors/developers, in the form of bug reports and emails, are a useful source of information to help understanding source code. However, such communications are unstructured and usually not explicitly meant to describe specific parts of the source code. De- velopers searching for code descriptions within communications face the challenge of filtering large amount of data to extract what pieces of information are important to them. We propose an approach to automatically extract method descriptions from communications in bug tracking systems and mailing lists. We have evaluated the approach on bug reports and mailing lists from two open source systems (Lucene and Eclipse). The results indicate that mailing lists and bug reports contain relevant descriptions of about 36% of the methods from Lucene and 7% from Eclipse, and that the proposed approach is able to extract such descriptions with a precision up to 79% for Eclipse and 87% for Lucene. The extracted method descriptions can help developers in understanding the code and could also be used as a starting point for source code re-documentation. Keywords: Code re-documentation, mining e-mails, pro- gram comprehension. I. I NTRODUCTION Consider the following situation. A developer is reading Java code of an unfamiliar (part of the) system. She encounters a methods call. Ideally, a good method name would indicate its purpose. If not, a nice Javadoc comment would explain what the goal of the method is. Unfortunately, the method name is poorly chosen and there are no comments. Not an uncommon situation. At this point, the developer has the choice of reading the implementation of the method or searching the external documentation. It is very rare that external documentation is written at method level granularity (especially when comments are missing) and that such specific information is easy to retrieve. The goal of our work is to help developers in such situations. Specifically, we aim at providing developers with a means to quickly access descriptions of methods. Our conjecture is that, if other developers had any issues related to a specific method, then a discussion occurred and someone described the method in the context of those issues. For example, developers and project contributors communi- cate with each other, through mailing lists and bug tracking systems. They often “instruct” each other about the behavior of a method. This can happen in at least two scenarios. First, when a person (sometimes a newcomer in the project) is trying to solve a problem or implement a new feature, she does not have enough knowledge about the system, and asks for help. Second, when a person explains to others the possible cause of a failure, illustrating the intended (and possibly also the unexpected) behavior of a method. For example, we report a paragraph for issue #1693 posted on the Lucene Jira bug- tracking system 1 : new method added to AttributeSource: addAt- tributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface- instance mappings to the attribute map for each of the found interfaces. which clearly describes the behavior of the AttributeSource: addAttributeImpl(AttributeImpl) method. We claim that unstructured communication between de- velopers can be a precious source of information to help understanding source code. So, why developers could not use simple text search tech- niques, based on text/regular expression matching utilities, such as, grep, to find method descriptions in communica- tion data? Such simple text matching approaches could only identify sentences having a method name, or in general any regular expression containing the method name plus other strings such as the class name or some parameter names. They would generate too many false positives. As it happens for requirement-to-code traceability recovery [1], [?], [2], such a simple matching is not enough. This paper presents and validates an approach to au- tomatically mine source code descriptions—in particular method descriptions—from developer communications, such as, emails and bug reports 2 . It also presents evidence to support our assumption that developer communications are rich in useful code descriptions. 1 https://issues.apache.org/jira/browse/LUCENE-1693 2 For simplicity, we will only refer to mailing lists/emails, although the approach is applicable to bug tracking systems and other similar communi- cations. Only where it matters we will refer to mailing lists and bug tracking systems separately. 1