Tracing Object-Oriented Code into Functional Requirements G. Antoniol , G. Canfora , G. Casazza , A. De Lucia , E.Merlo antoniol@ieee.org gerardo.canfora,gec,delucia @unisannio.it ettore.merlo@polymtl.ca University of Sannio, Faculty of Engineering - Piazza Roma, I-82100 Benevento, Italy University of Naples ”Federico II”, DIS - Via Claudio 21, I-80125 Naples, Italy Dep. Electrical and Computer Engineering Ecole Politechnique, C.P. 6079, Succ. Centre Ville, Montreal, Quebec, Canada Abstract Software system documentation is almost always ex- pressed informally, in natural language and free text. Ex- amples include requirement specifications, design docu- ments, manual pages, system development journals, error logs and related maintenance reports. We propose an approach to establish and maintain trace- ability links between source code and free text documents. A premise of our work is that programmers use mean- ingful names for program items, such as functions, vari- ables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemon- ics for identifiers; therefore, the analysis of these mnemon- ics can help to associate high level concepts with program concepts, and vice-versa. In this paper, the approach is applied to software writ- ten in an object-oriented language, namely Java, to trace classes to functional requirements. Keywords: redocumentation, traceability, program comprehension, object orientation 1. Introduction The research reported in this paper addresses the prob- lem of establishing traceability links between the free text documentation associated with the development and main- tenance cycle of a software system and its source code. These links help program comprehension in several ways. Existing cognition models share the idea that pro- gram comprehension can occur in a bottom-up manner [19] [18], a top-down manner [4] [23], or some combination of the two [11] [13] [14] [15]. They also agree that pro- grammers use different types of knowledge during program comprehension, ranging from domain specific knowledge to general programming knowledge [4] [22] [25]. Traceability links between areas of code and related sec- tions of free text documents, such as an application domain handbook, a specification document, a set of design docu- ments, or manual pages, aid both top-down and bottom-up comprehension. In top-down comprehension, once a hy- pothesis has been formulated, the traceability links provide hints on where to look for beacons that either confirm or confute it. In bottom-up comprehension the main role of the traceability links is to assist programmers in the assign- ment of a concept to a chunk of code and in the aggregation of chunks into hierarchies of concepts. Traceability links between code and other sources of information are also a valuable help to perform the combined analysis of hetero- geneous information and, ultimately, to construct a mental model of the software under consideration. At WCRE’99 [1] we presented a method to establish and maintain traceability links between code and free text doc- uments. A premise of the method is that developers use meaningful names for program items, in particular classes, variables, methods, and exchange parameters. The under- lying assumption is that the application-domain knowledge programmers process when writing the code is captured by mnemonics for identifiers. The method exploits probabilis- tic information retrieval techniques to estimate a language model for each document or document section and apply Bayesian classification to score the sequence of mnemonics extracted from a selected area of code against the language models. Higher scores suggest the existence of links be- tween the area of code from which a particular sequence of mnemonics is extracted and the document that generated the language model. The paper presented the application of the method to es- tablish traceability links between C++ source classes and manual pages and discussed the results of a case study on a C++ class library, namely LEDA (Library of Efficient Data 0-7695-0656-9/00 $10.00 ã 2000 IEEE