Testing Software In Age Of Data Privacy: A Balancing Act Kunal Taneja North Carolina State University Raleigh, NC 27695 ktaneja@ncsu.edu Mark Grechanik Accenture Technology Labs Chicago, IL 60601 mark.grechanik@ accenture.com Rayid Ghani Accenture Technology Labs Chicago, IL 60601 rayid.ghani@ accenture.com Tao Xie North Carolina State University Raleigh, NC 27695 xie@csc.ncsu.edu ABSTRACT Database-centric applications (DCAs) are common in enterprise computing, and they use nontrivial databases. Testing of DCAs is increasingly outsourced to test centers in order to achieve lower cost and higher quality. When proprietary DCAs are released, their databases should also be made available to test engineers. How- ever, different data privacy laws prevent organizations from shar- ing this data with test centers because databases contain sensitive information. Currently, testing is performed with anonymized data, which often leads to worse test coverage (such as code coverage) and fewer uncovered faults, thereby reducing the quality of DCAs and obliterating benefits of test outsourcing. To address this issue, we offer a novel approach that combines program analysis with a new data privacy framework that we design to address constraints of software testing. With our approach, or- ganizations can balance the level of privacy with needs of testing. We have built a tool for our approach and applied it to nontrivial Java DCAs. Our results show that test coverage can be preserved at a higher level by anonymizing data based on their effect on cor- responding DCAs. Categories and Subject Descriptors D.2.5 [Software Engineering, Testing and Debugging]: Testing tools; D.4.6 [Software Engineering, Security and Protection]: Information flow controls; K.4.1 [Computers and Society, Public Policy Issues]: Privacy General Terms Security, Verification Keywords Data anonymity, software testing, privacy framework, utility, PRIEST 1. INTRODUCTION Large organizations today face many challenges when engineer- ing software applications. Particularly challenging is the fact that Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ESEC/FSE’11, September 5–9, 2011, Szeged, Hungary. Copyright 2011 ACM 978-1-4503-0443-6/11/09 ...$10.00. many applications work with existing databases that contain confi- dential data. A large organization, such as a bank, insurance com- pany, or government agency, typically hires an external company to develop or test a new custom software application. However, recent data protection laws and regulations [35] around the world prohibit data owners to easily share confidential data with external software service providers. Database-centric applications (DCAs) are common in enterprise computing, and they use nontrivial databases [26]. When releasing these proprietary DCAs to external test centers, it is desirable for DCA owners to make their databases available to test engineers, so that they can perform testing using original data. However, since sensitive information cannot be disclosed to external organizations, testing is often performed with synthetic input data. For instance, if values of the field Nationality are replaced with the generic value Human,” DCAs may execute some paths that result in exceptions or miss certain paths [23]. As a result, test centers report worse test coverage (such as code coverage) and fewer uncovered faults, thereby reducing the quality of applications and obliterating bene- fits of test outsourcing [30]. Automatic approaches for test data generation [12, 17, 22, 29, 37] partially address this problem by generating synthetic input data that lead program execution toward untested statements. How- ever, one of the main issues for these approaches is how to gener- ate synthetic input data with which test engineers can achieve good code coverage. Using original data enables different approaches in testing and privacy to produce higher-quality synthetic input data [3][21, page 42], thus making original data important for test out- sourcing. A fundamental problem in test outsourcing is how to allow a DCA owner to release its private data with guarantees that the enti- ties in this data (e.g., people, organizations) are protected at a cer- tain level while retaining testing efficacy. Ideally, sanitized data (sanitized data or anonymized data is the original data after anony- mization. We use the two terms interchangeably throughout this pa- per) should induce execution paths that are similar to the ones that are induced by the original data. In other words, when data is san- itized, information about how DCAs use this data should be taken into consideration. In practice, this consideration rarely happens; our previous work [23] showed that a popular data anonymization algorithm, called k–anonymity, seriously degrades test coverage of DCAs. Naturally, different DCAs have different privacy goals and lev- els of data sensitivity – privacy goals are more relaxed for a DCA that manages a movie ticket database than for a DCA that is used within banks or government security agencies. Applying more re- laxed protection to databases is likely to result in greater test cover- age since a small part of the databases is anonymized; conversely, 201