Impact of Programming Language Fragmentation on Developer Productivity: A SourceForge Empirical Study Jonathan L. Krein, Alexander C. MacLean, Charles D. Knutson SEQuOIA Lab, Department of Computer Science, Brigham Young University jonathankrein@byu.net, amaclean@byu.net, knutson@cs.byu.edu Daniel P. Delorey Google, Inc. delorey@google.com Dennis L. Eggett Department of Statistics, Brigham Young University theegg@stat.byu.edu ABSTRACT Programmers often develop software in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a devel- oper’s problem-solving abilities—we present a metric, lan- guage entropy, for characterizing the distribution of a de- veloper’s programming efforts across multiple programming languages. We then present an observational study exam- ining the project contributions of a random sample of 500 SourceForge developers. Using a random coefficients model, we find a statistically (alpha level of 0.001) and practically significant correlation between language entropy and the size of monthly project contributions. Our results indicate that programming language fragmentation is negatively related to the total amount of code contributed by developers within SourceForge, an open source software (OSS) community. 1. INTRODUCTION The ultimate deliverable for a software project is a source code artifact that enables computers to meet human needs. The process of software development, therefore, involves both problem solving and the communication of solutions to a computer in the form of software. We believe that the pro- gramming languages with which developers communicate so- lutions to computers may in fact play a role in the complex processes by which those developers generate their solutions. Baldo et al. define language as a “rule-based, symbolic rep- resentation system” that “allows us to not simply represent concepts, but more importantly for problem solving, facili- tates our ability to manipulate those concepts and generate novel solutions” [2]. Although their study focused on the relationship between natural language and problem solv- ing, their concept of language is highly representative of languages used in programming activities. Other research in the area of linguistics examines the differences between Manuscript of this paper submitted for publication to the International Journal of Open Source Software & Processes (IJOSSP), De- cember 4, 2009. mono-, bi-, and multilingual speakers. One particular study, focusing on the differences between mono- and bilingual chil- dren, found specific differences in the subjects’ abilities to solve problems [3]. These linguistic studies prompt us to ask questions about the effect that working concurrently in mul- tiple programming languages (a phenomenon we refer to as language fragmentation ) has on the problem-solving abilities of developers. In an effort to increase both the quality of software appli- cations and the efficiency with which applications can be written, developers often incorporate multiple programming languages into software projects. Each language is selected to meet specific project needs, to which it is specialized—for instance, in a web application a developer might select SQL for database communication, PHP for server-side process- ing, JavaScript for client-side processing, and HTML/CSS for the user interface. Although language specialization ar- guably introduces benefits, the total impact of the result- ing language fragmentation on developer performance is un- clear. For instance, developers may solve problems more efficiently when they have multiple language paradigms at their disposal. However, the overhead of maintaining effi- ciency in more than one language may also outweigh those benefits. Further, development directors and programming team managers must make resource allocation, staff training, and technology acquisition decisions on a daily basis. Un- derstanding the impact of language fragmentation on devel- oper performance would enable software companies to make better-informed decisions regarding which programming lan- guages to incorporate into a project, as well as regarding the division of developers and testers across those languages. To begin understanding these issues, this paper explores the relationship between language fragmentation and developer productivity. In Sections 2 and 3 we define and justify the metrics used in the paper. We first discuss our selection of a productivity metric, after which we describe an entropy- based metric for characterizing the distribution of a devel- oper’s efforts across multiple programming languages. Hav- ing defined the key terms, Section 4 presents the thesis of the paper, and Sections 5 and 6 describe, justify, and vali- date the data and analysis techniques. We then present in Section 7 the results of an observational study of Source- Forge, an open source software (OSS) community, in which we demonstrate a significant relationship between language