GKLEE: Concolic Veriﬁcation and Test Generation for GPUs Guodong Li * Fujitsu Laboratories of America, Sunnyvale, CA 94085, USA gli@us.fujitsu.com Peng Li Geof Sawaya Ganesh Gopalakrishnan School of Computing, University of Utah, Salt Lake City, UT 84112, USA {peterlee,sawaya,ganesh}@cs.utah.edu Indradeep Ghosh Sreeranga P. Rajan Fujitsu Laboratories of America, Sunnyvale, CA 94085, USA {ighosh,sree.rajan}@us.fujitsu.com Abstract Programs written for GPUs often contain correctness errors such as races, deadlocks, or may compute the wrong result. Existing debugging tools often miss these errors because of their limited input-space and execution-space exploration. Existing tools based on conservative static analysis or conservative modeling of SIMD concurrency generate false alarms resulting in wasted bug-hunting. They also often do not target performance bugs (non-coalesced memory accesses, memory bank conﬂicts, and divergent warps). We provide a new framework called GKLEE that can analyze C++ GPU programs, locating the aforesaid correctness and performance bugs. For these programs, GKLEE can also automatically generate tests that provide high coverage. These tests serve as concrete wit- nesses for every reported bug. They can also be used for down- stream debugging, for example to test the kernel on the actual hard- ware. We describe the architecture of GKLEE, its symbolic virtual machine model, and describe previously unknown bugs and per- formance issues that it detected on commercial SDK kernels. We describe GKLEE’s test-case reduction heuristics, and the resulting scalability improvement for a given coverage target. Categories and Subject Descriptors: D.2.4 [Software Engineer- ing]: Software/Program Veriﬁcation—Validation General Terms: Reliability, Veriﬁcation Keywords: GPU, CUDA, Parallelism, Symbolic Execution, For- mal Veriﬁcation, Automatic Test Generation, Virtual Machine 1. Introduction Multicore CPUs and GPUs are making inroads into virtually all aspects of computing, from portable information appliances to su- percomputers. Unfortunately, programming multicore systems to achieve high performance often requires many intricate optimiza- tions involving memory bandwidth and the CPU/GPU occupancy. A majority of these optimizations are still being carried out manu- ally. Given the sheer complexity of these optimizations in the con- text of actual problems, designers routinely introduce correctness and performance bugs. Locating these bugs using today’s commer- * Guodong Li started this project while a student of University of Utah. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. PPoPP’12, February 25–29, 2012, New Orleans, Louisiana, USA. Copyright c  2012 ACM 978-1-4503-1160-1/12/02. . . $10.00 cial debuggers is always a ‘hit-or-miss’ affair: one has to be lucky in so many ways, including (i) picking the right test inputs, (ii) ability to observe of data corruption (and be able to reliably attribute it to races), (iii) whether the compiler optimization match programmer assumptions, and (iv) whether the platform masks bugs because of the speciﬁc thread/warp scheduling algorithms used. If the execu- tion deadlocks, one has to manually reason out the root-cause. Recent formal and semi-formal analysis based tools [1–3] have improved the situation in many ways. They, in effect, examine whole classes of inputs and executions, by resorting to symbolic analysis or static analysis methods. They also analyze abstract GPU models without making hardware-speciﬁc thread scheduling as- sumptions. These tools also have many drawbacks. The ﬁrst prob- lem with predominantly static analysis based approaches is false alarms. False alarms waste precious designer time and may dis- suade them from using a tool. Another limitation of today’s tools is that they do not help generate tests that achieve high code cover- age. Such tests are important for unearthing compiler bugs or “un- expected” bugs that surface during hardware execution. Existing tools also do not cover one new data race category that we identify (we call it warp-divergence race). Compilation based approaches can, in many cases, eliminate the drudgery of GPU program op- timization; however, their code transformation scripts are seldom separately formally veriﬁed. We present a new tool framework called GKLEE for analyzing GPU programs with respect to important correctness and perfor- mance issues (the tool name coming from “GPU” and “KLEE [4]). GKLEE proﬁts from KLEE’s code base and philosophy of testing a given program using concrete plus symbolic (“concolic”) execu- tion. GKLEE is the ﬁrst concolic veriﬁer and test generator tailored for GPU programs. Concolic veriﬁers allow designers to declare certain input variables as ‘symbolic’ (the remaining inputs are con- crete). In GKLEE, the execution of a program expression containing symbolic variables results in constraints amongst the program vari- ables, including constraints due to conditionals, and explicit con- straints (assume statements) on symbolic inputs. Conditionals are resolved by KLEE’s decision procedures (“SMT solvers [5]”) that ﬁnd solutions for symbolic program inputs. This approach helps concolic veriﬁers do something beyond bug-hunting: they can au- tomatically enumerate test inputs in a demand-driven manner. That is, if there is a control/branch decision that can be affected by some input, a concolic veriﬁer can automatically compute and record the input value in a test which is valuable for downstream debug- ging. Recent experience shows that formal methods often have the biggest impact when they can compute tests automatically, expos- ing software defects and vulnerability [6–8]. The architecture of GKLEE is shown in Figure 1. It employs a C/C++ front-end based on LLVM-GCC (with our customized