Exploring PHP Feature Usage for Static Analysis Mark Hills * , Paul Klint *† , and Jurgen J. Vinju *† * Centrum Wiskunde & Informatica Amsterdam, The Netherlands INRIA Lille Nord Europe Lille, France {Mark.Hills,Paul.Klint,Jurgen.Vinju}@cwi.nl Abstract—PHP is one of the most popular languages for server- side application development. As with other scripting languages, PHP contains a number of dynamic features that pose challenges to the efficiency and precision of static analysis tools. Because of this, many static analysis tools and techniques applied to other languages have not been applied to PHP. In this paper we study how PHP is used in practice. Our goal is to provide the overview and insights needed to steer the design and implementation of static analysis tools for PHP that can be used to detect security vulnerabilities, find other subtle programming errors, and support PHP code refactoring. We analyze, in general, which features need to be supported to reach reasonable coverage over a representative corpus. More importantly, we focus on language features that pose a challenge to static analysis, and we explore how and when they occur in real programs. Based on this analysis, we recommend several lightweight techniques to mitigate the effect of these features on analysis tools. I. I NTRODUCTION PHP [1], invented by Rasmus Lerdorf in 1994, is a general- purpose programming language focused on server-side applica- tion development. PHP is one of the most popular languages for website development, installed on almost 25 million sites [2] and ranking 6th on the TIOBE programming community index [3] as of August 2012. Starting as an imperative language, PHP now includes a single-inheritance class model and features such as interfaces, namespaces, exceptions, traits, and closures. Like many other scripting languages, PHP is dynamically typed. Type correctness is judged based on duck typing, allowing values to be used whenever they can behave like values of the expected type. For instance, the strings "3" and "4", when added together, yield the number 7; the numbers 3 and 4, when concatenated, yield the string "34"; and a call to method m is supported by any object that implements m, assuming the correct number of parameters is provided. PHP’s flexibility and dynamic nature may sometimes yield unexpected results and make programs hard to understand. The files that are included in another file are computed at runtime, making it difficult to know, before execution, the text of the program that will actually run; variable features provide reflective access to variables, classes, functions, meth- ods, and properties through strings; magic methods allow accesses to either non-existent or non-public methods and properties to be handled on the fly; and the behavior of built-in operations can be puzzling, returning unexpected results (e.g., "hello"+"world" is equal to 0). The dynamic nature of PHP programs, in conjunction with the lack of types, provides strong motivation for the construction of program analysis tools to aid in program understanding, testing, security vulnerability detection, refactoring, and debugging. Unfortunately, the same features that make it challenging for a human to understand PHP programs also impact the correctness, precision and efficiency of static program analysis tools. We are interested in answering the following questions: Q1 What part of PHP would an analysis tool need to support to cover a significant percentage of real PHP code? See Section V. Q2 Where, how often and how are some of the harder to analyze language features used in existing PHP code? See Section VI. Q3 Which lightweight techniques can we identify to mitigate the problems caused by these hard to analyze features? See Section VI. To answer these questions, we have assembled a large corpus of open-source PHP systems, described further in Section III. We then analyzed this corpus using the Rascal [4] meta- programming language (Section IV). Sections V and VI contain our main results and Section VII presents final thoughts and concludes. As is shown in the next section, we are not the first to have done such an empirical study of programs and of how language features are used. II. RELATED WORK 1) Observing usage: To optimize programming language design, Knuth [5] proposed instrumenting the FORTRAN compiler to acquire usage statistics for language features, and to instrument user code to measure specific statement usage to generate run-time profiles. While CPU profiling is now a widely accepted method, language feature profiling is not. As Knuth aptly states, observing how a language is used could provide valuable insight into what a compiler should actually provide in terms of syntactic constructs, as well as which idioms could be beneficial to focus on for optimization purposes. We share a similar goal in this paper: we want to know how PHP is used to be able to steer the quality of our code processor, which, in our case, is a static analyzer. Similar work was performed by Morandat et al. [6], who focused on evaluating the design of the R language over a corpus of 3.9 million lines of R code. They employed a