Learning from Examples to Find Fully Qualified Names of API Elements in Code Snippets C M Khaled Saifullah Muhammad Asaduzzaman† Chanchal K. Roy Department of Computer Science, University of Saskatchewan, Canada †School of Computing, Queen's University, Canada {khaled.saifullah, chanchal.roy}@usask.ca †muhammad.asaduzzaman@queensu.ca Abstract—Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usages of software frameworks or libraries. These code snippets often contain am- biguous undeclared external references. Such external references make it difficult to learn and use those APIs correctly. In particular, reusing code snippets containing such ambiguous undeclared external references requires significant manual efforts and expertise to resolve them. Manually resolving fully qualified names (FQN) of API elements is a non-trivial task. In this paper, we propose a novel context-sensitive technique, called COSTER, to resolve FQNs of API elements in such code snippets. The pro- posed technique collects locally specific source code elements as well as globally related tokens as the context of FQNs, calculates likelihood scores, and builds an occurrence likelihood dictionary (OLD). Given an API element as a query, COSTER captures the context of the query API element, matches that with the FQNs of API elements stored in the OLD, and rank those matched FQNs leveraging three different scores: likelihood, context similarity, and name similarity scores. Evaluation with more than 600K code examples collected from GitHub and two different Stack Overflow datasets shows that our proposed technique improves precision by 4-6% and recall by 3-22% compared to state-of- the-art techniques. The proposed technique significantly reduces the training time compared to the StatType, a state-of-the-art technique, without sacrificing accuracy. Extensive analyses on results demonstrate the robustness of the proposed technique. Index Terms—API usages, Context sensitive technique, Rec- ommendation system, Fully Qualified Name I. I NTRODUCTION Developers extensively reuse Application Programming In- terfaces (APIs) of software frameworks and libraries to save both development time and effort. This requires learning new APIs during software development. However, inadequate and outdated documentation of APIs hinder the learning process [1], [2]. As a result, developers favor code examples over documentation [3]. To understand APIs with code examples, developers explore online forums, such as Stack Overflow (SO) 1 , GitHub Issues, GitHub Gists 2 and so on. These online forums provide a good amount of resources regarding API usages [4]. However, such usage examples can suffer from external reference and declaration ambiguities when one at- tempts to compile them [2], [5]. External reference ambiguity occurs due to missing external references, whereas declaration ambiguity is caused by missing declaration statements. As a result of these ambiguities, code snippets from online forums 1 https://stackoverflow.com 2 https://gist.github.com/discover are difficult to compile and run. According to Horton and Parnin [6] only 1% of the Java and C# code examples included in the Stack Overflow posts are compilable. Yang et al. [7] also report that less than 25% of Python code snippets in GitHub Gist are runnable. Resolving FQNs of API elements can help to identify missing external references or declaration statements. Prior studies link API elements in forum discussions to their documentation using Partial Program Analysis (PPA) [8], text analysis [2], [9], and iterative deductive analysis [5]. All these techniques except Baker [5], need adequate documentation or discussion in online forums. However, 47% of the APIs do not have any documentation [10] and such APIs cannot be resolved by those techniques. Baker [5] depends on scope rules and relationship analysis to deduce FQNs of API elements. However, the technique fails to leverage the code context and cannot infer 15-31% of code snippets due to inadequate infor- mation within the scope [11]. Recently, Statistical Machine Translation (SMT) is used to determine FQNs of APIs in StatType [11]. However, the technique requires a large number of code examples to train and it performs poorly for APIs having fewer examples. The training time of StatType is also considerably higher than other techniques. In this paper, we propose a context-sensitive type solver, called COSTER. The proposed technique collects locally spe- cific source code elements as well as globally related tokens as the context of FQNs of API elements. We calculate the likelihood of appearing context tokens and the FQN of each API element. The collected usage contexts and likelihood scores are indexed based on the FQNs of API elements in the occurrence likelihood dictionary (OLD). Given an API element as a query, COSTER first collects the context of the query API element. It then matches the query context with that of the FQNs of API elements stored in the OLD, and then rank those FQNs leveraging three different scores: likelihood, context similarity, and name similarity scores. We compare COSTER against two state-of-the-art tech- niques for resolving FQNs of API elements, called Baker [5] and StatType [11], using more than 600K code snippets from GitHub [12] and two different Stack Overflow (SO) datasets. We not only reuse the SO dataset prepared by Phan et al. [11] but also build another dataset of 500 SO posts. Results from our evaluation show that COSTER improves precision by 4-6% and recall by 3-22% compared to state-of-the-art