The cost of the build tax in scientiﬁc software Lorin Hochstein USC/ISI, Arlington, VA lorin@isi.edu Yang Jiao Virginia Tech, Blacksburg, VA jiaoyang@vt.edu Abstract —All compiled software systems require a build system: a set of scripts to invoke compilers and linkers to generate the ﬁnal executable binaries. For scientiﬁc software, these build scripts can become extremely complex. Anecdotes suggest that scientiﬁc programmers have long been dissatisﬁed with the current software build toolchains. In this paper, we describe preliminary results from a case study of two projects to estimate the fraction of eﬀort devoted to maintaining these scripts, which we refer to as the ‘build tax’. While estimates based on line counts are on the order of only 5%, estimates based on activity- related metrics suggest much higher values. Keywords -scientiﬁc computing; makeﬁles; case studies; software repositories I. Introduction When a developer sets out to write a program in a compiled language, she will invariably write two: the main program itself, and a secondary one that invokes compilers and linkers to transform the main program from source code to executable binary. These secondary programs are commonly implemented as makeﬁles,a script-based technology which dates back to the mid- seventies [3]. As a community, we software engineering researchers have not paid much attention to the devel- opment of build scripts. However, for scientiﬁc software, writing and maintaining these build scripts can be a substantial headache. In this paper, we refer to this ad- ditional eﬀort overhead to maintain build scripts as the build tax. In order to estimate the magnitude of the build tax, we performed a case study of two computational science projects. By focusing on two projects, we hope to gain insights into the software development process of computational scientists and provide initial estimates on the impact of build eﬀort that will serve as a starting- oﬀ point for future studies, as well as to motivate the development of better build tools. The projects we selected for our case studies have much in common: both incorporate simulations of ther- monuclear reactions, are written mostly in Fortran, and have access to unclassiﬁed supercomputers located at U.S. Department of Energy (DOE) facilities. The ﬁrst is the FACETS project 1 , a distributed computational 1 http://www.facetsproject.org science software project led by Tech-X Corporation. The second is the Flash Center 2 , a collocated compu- tational science software project based at the University of Chicago. A. Background: FACETS and FLASH projects The FACETS (Framework Application for Core-Edge Transport Simulations) project was started in early 2007 with the goal of providing a framework for simulation plasma conﬁnement for the fusion energy platform. The project is supported by funding from the U.S. Depart- ment of Energy (DOE). The project team is distributed across eleven organizations, including commercial com- panies, U.S. government laboratories, and university labs. Because the project is spread across multiple or- ganizations, the physics components are developed in diﬀerent institutional cultures, where each culture repre- sents unique challenges and code development practices that impact software engineering issues. The Flash Center was started at the University of Chicago around 1997 with the goal of studying ther- monuclear ﬂashes, events of rapid or explosive ther- monuclear burning that occur on the surfaces and in the interiors of compact stars. To conduct this research, the Flash Center developed a simulation code called FLASH [1]. The FLASH code simulates the thermonu- clear explosions of stars, and can be used for vari- ous other astrophysical, cosmological and computational ﬂuid dynamics simulations. The FLASH code is used both by scientists aﬃliated with the Flash Center as well as external users, who can obtain access to the source code at no cost if certain conditions are met. B. Prior beliefs Our prior beliefs about build eﬀort going into this project were based on the prior work of Kumfert and Epperly [5]. They ran a survey of computational scien- tists at DOE labs and universities and found reported overheads of about 12%, with individual cases into the 20–30% range. We believed that the FACETS project would lie at the higher end of this scale because it has a large number of external libraries as dependencies and 2 http://ﬂash.uchicago.edu