Impact of Programming Language Fragmentation on Developer Productivity: A SourceForge Empirical Study Jonathan L. Krein, Alexander C. MacLean, Charles D. Knutson SEQuOIA Lab, Department of Computer Science, Brigham Young University jonathankrein@byu.net, amaclean@byu.net, knutson@cs.byu.edu Daniel P. Delorey Google, Inc. delorey@google.com Dennis L. Eggett Department of Statistics, Brigham Young University theegg@stat.byu.edu ABSTRACT Programmers often develop software in multiple languages. In an eﬀort to study the eﬀects of programming language fragmentation on productivity—and ultimately on a devel- oper’s problem-solving abilities—we present a metric, lan- guage entropy, for characterizing the distribution of a de- veloper’s programming eﬀorts across multiple programming languages. We then present an observational study exam- ining the project contributions of a random sample of 500 SourceForge developers. Using a random coeﬃcients model, we ﬁnd a statistically (alpha level of 0.001) and practically signiﬁcant correlation between language entropy and the size of monthly project contributions. Our results indicate that programming language fragmentation is negatively related to the total amount of code contributed by developers within SourceForge, an open source software (OSS) community. 1. INTRODUCTION The ultimate deliverable for a software project is a source code artifact that enables computers to meet human needs. The process of software development, therefore, involves both problem solving and the communication of solutions to a computer in the form of software. We believe that the pro- gramming languages with which developers communicate so- lutions to computers may in fact play a role in the complex processes by which those developers generate their solutions. Baldo et al. deﬁne language as a “rule-based, symbolic rep- resentation system” that “allows us to not simply represent concepts, but more importantly for problem solving, facili- tates our ability to manipulate those concepts and generate novel solutions” [2]. Although their study focused on the relationship between natural language and problem solv- ing, their concept of language is highly representative of languages used in programming activities. Other research in the area of linguistics examines the diﬀerences between Manuscript of this paper submitted for publication to the International Journal of Open Source Software & Processes (IJOSSP), De- cember 4, 2009. mono-, bi-, and multilingual speakers. One particular study, focusing on the diﬀerences between mono- and bilingual chil- dren, found speciﬁc diﬀerences in the subjects’ abilities to solve problems [3]. These linguistic studies prompt us to ask questions about the eﬀect that working concurrently in mul- tiple programming languages (a phenomenon we refer to as language fragmentation ) has on the problem-solving abilities of developers. In an eﬀort to increase both the quality of software appli- cations and the eﬃciency with which applications can be written, developers often incorporate multiple programming languages into software projects. Each language is selected to meet speciﬁc project needs, to which it is specialized—for instance, in a web application a developer might select SQL for database communication, PHP for server-side process- ing, JavaScript for client-side processing, and HTML/CSS for the user interface. Although language specialization ar- guably introduces beneﬁts, the total impact of the result- ing language fragmentation on developer performance is un- clear. For instance, developers may solve problems more eﬃciently when they have multiple language paradigms at their disposal. However, the overhead of maintaining eﬃ- ciency in more than one language may also outweigh those beneﬁts. Further, development directors and programming team managers must make resource allocation, staﬀ training, and technology acquisition decisions on a daily basis. Un- derstanding the impact of language fragmentation on devel- oper performance would enable software companies to make better-informed decisions regarding which programming lan- guages to incorporate into a project, as well as regarding the division of developers and testers across those languages. To begin understanding these issues, this paper explores the relationship between language fragmentation and developer productivity. In Sections 2 and 3 we deﬁne and justify the metrics used in the paper. We ﬁrst discuss our selection of a productivity metric, after which we describe an entropy- based metric for characterizing the distribution of a devel- oper’s eﬀorts across multiple programming languages. Hav- ing deﬁned the key terms, Section 4 presents the thesis of the paper, and Sections 5 and 6 describe, justify, and vali- date the data and analysis techniques. We then present in Section 7 the results of an observational study of Source- Forge, an open source software (OSS) community, in which we demonstrate a signiﬁcant relationship between language