0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2732347, IEEE Transactions on Software Engineering TRANSACTIONS ON SOFTWARE ENGINEERING 1 A Theoretical and Empirical Study of Diversity-aware Mutation Adequacy Criterion Donghwan Shin, Student Member, IEEE, Shin Yoo, and Doo-Hwan Bae, Member, IEEE Abstract—Diversity has been widely studied in software testing as a guidance towards effective sampling of test inputs in the vast space of possible program behaviors. However, diversity has received relatively little attention in mutation testing. The traditional mutation adequacy criterion is a one-dimensional measure of the total number of killed mutants. We propose a novel, diversity-aware mutation adequacy criterion called distinguishing mutation adequacy criterion, which is fully satisfied when each of the considered mutants can be identified by the set of tests that kill it, thereby encouraging inclusion of more diverse range of tests. This paper presents the formal definition of the distinguishing mutation adequacy and its score. Subsequently, an empirical study investigates the relationship among distinguishing mutation score, fault detection capability, and test suite size. The results show that the distinguishing mutation adequacy criterion detects 1.33 times more unseen faults than the traditional mutation adequacy criterion, at the cost of a 1.56 times increase in test suite size, for adequate test suites that fully satisfies the criteria. The results show a better picture for inadequate test suites; on average, 8.63 times more unseen faults are detected at the cost of a 3.14 times increase in test suite size. Index Terms—Mutation testing, test adequacy criteria, diversity ✦ 1 I NTRODUCTION O NE fundamental limitation of software testing is the fact that, to validate the behavior of the Program Under Test (PUT), we can only ever sample a very small number of test inputs out of the vast input space. Almost all existing testing techniques are, at some level, attempts to answer the following question: how does one sample a finite number of test inputs to cover as wide a range of program behaviors as possible? The concept of diversity has received much attention while answering the above question. For example, Adaptive Random Testing (ART) [1] seeks to increase the diversity of randomly sampled test inputs by choosing an input that is as different from those already sampled as possible. Clustering-based test selection and prioritization [2], [3] assumes that a diverse set of test inputs would explore and validate a wider range of program behaviors. Diversity in test output has been studied as a test adequacy criterion for black box testing of web applications [4]. Information theoretic measures of diversity have also been studied as test selection criteria [5], [6]. In contrast, improving the test effectiveness based on the diversity has received little attention in mutation testing; the emphasis has been instead on the reduction of mutation test- ing cost. As classified by Jia and Harman [7], a good many studies attempt to reduce the cost by mutant sampling [8], [9], selective mutation [10], higher order mutation [11], [12], mutant clustering [13], [14], and mutant subsumption [15], [16], [17]. However, the foundation of mutation testing (i.e., mutation adequacy criterion) remains essentially the same as it was when first proposed in the 1970s [18]. The mutation ad- equacy criterion is a testing criterion that estimates the real • D. Shin, S. Yoo, and DH. Bae are with the School of Computing, KAIST, Daejeon, Republic of Korea. E-mail: donghwan@se.kaist.ac.kr, shin.yoo@kaist.ac.kr, bae@se.kaist.ac.kr Manuscript received 10 Aug. 2016. fault detection capability of a test suite by the simple count of the number of artificially generated faulty programs (i.e., mutants) distinguished (i.e., killed) from its original program. Despite its potential correlation between the diversity of mutants and the real fault detection capability, the mutation- adequate test suite does not fully exploit the diversity. Suppose a pathological case in which a single test can kill all generated mutants. The traditional mutation adequacy crite- rion simply determines the single test as adequate, although the single test does not consider the diversity of the mutants. If we had a richer mutation adequacy criterion, it would be possible to have more powerful mutation-adequate test suites using the same set of mutants. Such a case calls for a richer mutation adequacy criterion. To tackle this problem, a novel mutation adequacy crite- rion called the distinguishing mutation adequacy criterion was proposed in our previous paper [19]. At the core of the new criterion lies the idea that mutants can be “distinguished” from each other by the set of tests that kill them. Our mutation adequacy criterion aims not only to kill, but also to distinguish as many mutants as possible, leading to a more diverse set of tests based on the same set of mutants. The empirical results on real faults showed that test suites adequate to the distinguishing mutation adequacy criterion can increase the fault detection rate by up to 76.8 percentage points in comparison to the traditional mutation adequate criterion [19]. However, since we considered only 100% ad- equate test suites for the mutation adequacy criteria, the re- lationship between the percentage of the mutation adequacy (i.e., mutation score) and the fault detection effectiveness was not fully investigated. In this paper, we significantly extend our previous work in a manner that is both theoretical and empirical. Theoret- ically, to capture the diversity of mutants in terms of a set of tests, we establish a firm definition of the mutant distin- guishment as the foundation of the distinguishing mutation