Aaron: An Adaptable Execution Environment Marc Br¨ unink, Andr´ e Schmitt, Thomas Knauth, Martin S¨ ußkraut, Ute Schiffel, Stephan Creutz, Christof Fetzer Technische Universit¨ at Dresden Department of Computer Science; 01062 Dresden; Germany {marc, andre, thomas, suesskraut, ute, stephan, christof}@se.inf.tu-dresden.de Abstract—Software bugs and hardware errors are the largest contributors to downtime [1], and can be permanent (e.g. deterministic memory violations, broken memory modules) or transient (e.g. race conditions, bitflips). Although a large variety of dependability mechanisms exist, only few are used in practice. The existing techniques do not prevail for several reasons: (1) the introduced performance overhead is often not negligible, (2) the gained coverage is not sufficient, and (3) users cannot control and adapt the mechanism. Aaron tackles these challenges by detecting hardware and software errors using automatically diversified software compo- nents. It uses these software variants only if CPU spare cycles are present in the system. In this way, Aaron increases fault coverage without incurring a perceivable performance penalty. Our evaluation shows that Aaron provides the same throughput as an execution of the original application while checking a large percentage of requests — whenever load permits. Keywords-Fault detection; Fault tolerance; Diversity meth- ods; Adaptive algorithm; Compiler transformation I. I NTRODUCTION More and more, our daily life depends upon computing systems. The proliferation of those systems is accompanied by a demand for security, safety, and availability. To satisfy these demands a large variety of dependability mechanism have been developed, both using either hardware or software solutions. Hardware solutions to dependability issues are costly to develop and deploy; they are a good choice for techniques that are mature. One example is the NX bit used by WˆX page protection (every memory page is either writable or executable per default). For dependability mechanisms that might be modified, most hardware solutions lack adaptivity. Techniques that are useful for only a minority of users are unlikely to be integrated into COTS hardware. Building spe- cialized hardware incorporating these techniques can result in a prohibitively high cost-performance ratio compared to COTS components. Thus, dependability mechanism should be implemented in software until they have matured and have been proven to be useful in a majority of application scenarios. c 2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promo- tional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. In contrast to COTS hardware, software solutions are highly adaptable. Furthermore, many different software solu- tions that cope with dependability issues are available. Those include, but are not limited to: out-of-bounds checker [2], redundant execution [3, 4], software encoded processing [5], and recovery blocks [6]. Each of these different approaches target a specific set of failures; however, none covers all failures observable in deployed systems. Although coverage can be increased by multiplexing dependability mechanisms, the overheads of different mechanisms add up. Even worse, interactions between them can lead to overheads larger than linear. In addition, multiplexing might have a negative effect on the coverage of the individual mechanisms. Some of the existing dependability mechanisms only solve problems in specific execution environments or program- ming languages. Naturally, the question arises, why one should apply a dependability mechanism instead of changing the underlying setup. For example, instead of deploying an out-of-bounds checker, a software engineer could simply use a programming language that is safe in respect to out- of-bounds accesses. Similar arguments can be applied to dangling references (garbage collected languages) and bit flips (redundant hardware). However, changing the program- ming language or execution environment in these ways only shifts dependability mechanisms to lower levels. Although shifting might enable stronger optimization or more efficient implementation, it also decreases adaptability. Adaptability is favorable because it can lead to higher efficiency in the long run: As more mature software is deployed, less errors are present in the software; but, the cost of checking is constant. As a result the cost-benefit ratio gets worse. Using a traditional deployment it is not possible to downscale checking. For example, once an application is developed in a safe language, it is hard to loose the incurred constraints during runtime in order to increase performance. Aaron tackles these challenges by scheduling different runtime checks dynamically depending on the load of the system and the maturity of the software. Maturity is not a monotonically increasing property but fluctuates especially at major releases. To this extent Aaron has to adapt, and, potentially, take hints from a system administrator. The different software diversity mechanisms we used to increase safety and security are discussed in Section II-B. Aaron uses CPU spare cycles to schedule software variants