The cost of the build tax in scientific software Lorin Hochstein USC/ISI, Arlington, VA lorin@isi.edu Yang Jiao Virginia Tech, Blacksburg, VA jiaoyang@vt.edu Abstract —All compiled software systems require a build system: a set of scripts to invoke compilers and linkers to generate the final executable binaries. For scientific software, these build scripts can become extremely complex. Anecdotes suggest that scientific programmers have long been dissatisfied with the current software build toolchains. In this paper, we describe preliminary results from a case study of two projects to estimate the fraction of effort devoted to maintaining these scripts, which we refer to as the ‘build tax’. While estimates based on line counts are on the order of only 5%, estimates based on activity- related metrics suggest much higher values. Keywords -scientific computing; makefiles; case studies; software repositories I. Introduction When a developer sets out to write a program in a compiled language, she will invariably write two: the main program itself, and a secondary one that invokes compilers and linkers to transform the main program from source code to executable binary. These secondary programs are commonly implemented as makefiles,a script-based technology which dates back to the mid- seventies [3]. As a community, we software engineering researchers have not paid much attention to the devel- opment of build scripts. However, for scientific software, writing and maintaining these build scripts can be a substantial headache. In this paper, we refer to this ad- ditional effort overhead to maintain build scripts as the build tax. In order to estimate the magnitude of the build tax, we performed a case study of two computational science projects. By focusing on two projects, we hope to gain insights into the software development process of computational scientists and provide initial estimates on the impact of build effort that will serve as a starting- off point for future studies, as well as to motivate the development of better build tools. The projects we selected for our case studies have much in common: both incorporate simulations of ther- monuclear reactions, are written mostly in Fortran, and have access to unclassified supercomputers located at U.S. Department of Energy (DOE) facilities. The first is the FACETS project 1 , a distributed computational 1 http://www.facetsproject.org science software project led by Tech-X Corporation. The second is the Flash Center 2 , a collocated compu- tational science software project based at the University of Chicago. A. Background: FACETS and FLASH projects The FACETS (Framework Application for Core-Edge Transport Simulations) project was started in early 2007 with the goal of providing a framework for simulation plasma confinement for the fusion energy platform. The project is supported by funding from the U.S. Depart- ment of Energy (DOE). The project team is distributed across eleven organizations, including commercial com- panies, U.S. government laboratories, and university labs. Because the project is spread across multiple or- ganizations, the physics components are developed in different institutional cultures, where each culture repre- sents unique challenges and code development practices that impact software engineering issues. The Flash Center was started at the University of Chicago around 1997 with the goal of studying ther- monuclear flashes, events of rapid or explosive ther- monuclear burning that occur on the surfaces and in the interiors of compact stars. To conduct this research, the Flash Center developed a simulation code called FLASH [1]. The FLASH code simulates the thermonu- clear explosions of stars, and can be used for vari- ous other astrophysical, cosmological and computational fluid dynamics simulations. The FLASH code is used both by scientists affiliated with the Flash Center as well as external users, who can obtain access to the source code at no cost if certain conditions are met. B. Prior beliefs Our prior beliefs about build effort going into this project were based on the prior work of Kumfert and Epperly [5]. They ran a survey of computational scien- tists at DOE labs and universities and found reported overheads of about 12%, with individual cases into the 20–30% range. We believed that the FACETS project would lie at the higher end of this scale because it has a large number of external libraries as dependencies and 2 http://flash.uchicago.edu