A multivariate classification of open source developers Enrico di Bella a , Alberto Sillitti b, , Giancarlo Succi b a Faculty of Economics, Università degli Studi di Genova, Genova, Italy b Faculty of Computer Science, Free University of Bolzano, Bolzano, Italy article info Article history: Received 12 January 2011 Received in revised form 5 June 2012 Accepted 22 September 2012 Available online 5 October 2012 Keywords: Open source Development process Empirical studies Software metrics abstract Open source software development is becoming always more relevant. Understanding the behavior of developers in open source software projects and identifying the kinds of their contributions is an essential step to improve the efficiency of the development process and to organize the development teams more effectively. Moreover, understanding the level of participation of the different developers helps to understand which members of the devel- opment team are more important than others and who are the actual key developers. This paper investigates the behavior of open source developers and the structure of the devel- opment of open source projects through the analysis of a very large dataset: 10 well-known and widely used open source software projects for a total of more than 4 MLOC (millions of lines of code) modified distributed in more than 200 K versions. This study builds on the top of other studies in this area applying a set of rigorous statistical techniques, analyzing how developers contribute to the projects. Its novelty is in the fine gain analysis of the developers that have commit rights on the repository of the project they work on, in the automated identification of key contributors of the project, in the size of the analyzed data- sets, and in the statistical techniques used to classify the behavior of the developers in an automated way. To collect such large volume of data and to ensure their integrity, a tool to automatically mine open source version control systems has been used. The main result of this study is the identification of a recurrent pattern of four kinds of contributors with the same characteristics in all the projects analyzed even if the projects are very different in domain, size, language, etc. Ó 2012 Elsevier Inc. All rights reserved. 1. Introduction Open Source Software (OSS) projects are always more and more popular and their business relevance is significant. Con- sequently, there is a growing interest in such projects from a user perspective and also from the development model per- spective [13]. Moreover, since many companies build their business on OSS (e.g., customizations, hardware and software products including OSS components, services, etc.), it is important for them to identify inside the community the key people that are the backbone of a OSS projects they base their business. This is important for several reasons including: The behavior of such key developers may have an impact on their business (e.g., leaving the project, forking the project, being hired by a competitor, etc.). 0020-0255/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2012.09.031 Corresponding author. Address: Faculty of Computer Science, Libera Università di Bolzano, Piazza Domenicani 3, I-39100 Bolzano, Italy. Tel.: +39 0471 016134. E-mail addresses: Enrico.diBella@economia.unige.it (E. di Bella), Alberto.Sillitti@unibz.it (A. Sillitti), Giancarlo.Succi@unibz.it (G. Succi). Information Sciences 221 (2013) 72–83 Contents lists available at SciVerse ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins