KUDA: The Split Race Checker U. Can Bekar UCBEKAR@KU. EDU. TR Computer Science and Engineering, Koc University, Istanbul, Turkey Tayfun Elmas ELMAS@EECS. BERKELEY. EDU Electrical Engineering and Computer Science, University of California, Berkeley, USA Semih Okur OKUR2@ILLINOIS. EDU Computer Engineering, University of Illinois at Urbana Champaign, Urbana, USA Serdar Tasiran STASIRAN@KU. EDU. TR Computer Science and Engineering, Koc University, Istanbul, Turkey Abstract Numerous runtime analysis tools are designed to observe the concurrency bugs in parallel pro- grams (i.e. heisenbugs), there exists commercial or open source types available. Whereas usabil- ity of these tools is hurt by their performance, slowdowns may reach several thousand times the application only run. Overhead of these tools can be divided to two tightly coupled jobs: trac- ing and analyzing. Traditionally, on-the-fly run- time verification algorithms have been designed to run on the same processing unit as the code being monitored and both costs contribute to the slowdown of the program being monitored. We propose a novel approach for runtime analysis framework, on commodity computers with the help of a GPU. Our approach allows us to carry out analysis work on separate, dedicated process- ing unit without any additional or custom hard- ware, in parallel. Simply put, we allow the par- allel runtime analysis to run concurrently in an- other runtime. As a demonstration of concept, we investigate on detecting concurrency bugs, in particular, data race detection; whereas one can potentially use this framework to carry out other types of analysis like specification viola- tions, containment and even recovery from er- rors. We only use an additional CPU thread, a fixed size 16 MB communication buffer and a CUDA-enabled GPU. Our experiments on the performance of the split race checker framework shows that it is on average 5 times faster than tra- ditional race checking. 1. Introduction We propose an approach to make use of some of the com- putation cores and other hardware resources in a computer to monitor the programs running on other cores for concur- rency errors, to contain and/or recover from these errors, if not immediately, shortly after they take place. While ex- ploring such an approach, we had two goals: (i) to have minimal, tolerable impact on the threads being monitored, and (ii) to have the monitoring algorithms work at the same speed as the program, while possibly lagging behind by a bounded amount. The rationale behind the first goal is to enable efficient, even post-deployment use of the moni- toring and bug-detection algorithms for safety-critical sys- tems. The rationale behind the second goal is to make it possible to contain concurrency errors, notify the threads that have experienced the errors, and gracefully shut down the program or to recover from the error. One result of the second goal is to force the monitoring framework to paral- lelize the event logging and analysis algorithms as much as possible. In our approach, the application being monitored and the actual monitoring code run on separate processing units/re- sources. They communicate with each other using some shared memory or message passing. The instrumented application code only has the additional responsibility of communicating relevant events to the monitoring code. The monitoring and runtime analysis code can be quite com- plex, but runs on separate processors and is parallelized, thus, the application performance is not affected by the run- time analyses being performed. We conjecture that the per- formance penalty on the application being monitored due to instrumentation and communication of relevant events can be reduced to negligible levels, for example, using inex- pensive hardware support such as hardware-assisted mes- sage passing (Francesco et al., 2005). The goal is for the monitoring code to run at least the same speed as the ap- plication being monitored, but lag behind by a very small delay due to event communication. (In our experiments this delay was in milliseconds.) This makes possible scenarios in which, in response to errors detected, the application has shut down gracefully, or a previous valid checkpoint is re- stored or the application is restarted.