Post-Silicon Debug Using Formal Verification Waypoints C. Richard Ho, Michael Theobald, Brannon Batson, J.P. Grossman, Stanley C. Wang, Joseph Gagliardo, Martin M. Deneroff, Ron O. Dror, David E. Shaw * D. E. Shaw Research, New York, NY 10036, USA {Richard.Ho, Michael.Theobald, Brannon.Batson, JP.Grossman, Stan.Wang, Joe.Gagliardo, Marty.Deneroff, Ron.Dror, David.Shaw}@DEShawResearch.com Abstract—Applying formal methods to assist in the post-silicon debugging of complex digital designs presents challenges that are distinct from those found in pre-silicon formal verification. In post-silicon debug, a set of observed events or conditions describes a failure scenario. The task is to identify a reasonably general set of input and hardware state conditions that inevitably produces that failure scenario. That set of conditions may be represented in the form of a counterexample to a desired property. Modern formal verification methods are especially adept at finding counterexamples to properties, and can often do so efficiently in large state spaces. This paper describes a method of assisting the discovery of counterexamples using user- hypothesized preconditions, or waypoints, of the failure. Each waypoint is an event that is believed to occur prior to the observed failure of the target property. By guiding formal analysis through a sequence of waypoints, the time required to find a counterexample of the target property can be significantly reduced. A specific case study is presented to illustrate the application and performance of our method using an actual example from the post-silicon debug of a 33-million–gate chip. I. INTRODUCTION The post-silicon debug of functional errors in large, highly complex Application-Specific Integrated Circuits (ASICs) frequently requires extensive detective work to isolate symptoms and identify underlying causes. Lack of observability, long runtimes to reach the error state, and imprecise control of event timing make many post-silicon bug hunts tedious and time-consuming endeavors. In this paper, we describe one such bug hunt involving the Anton ASIC [1], a 33-million–gate chip designed to accelerate molecular dynamics (MD) calculations. In this case, the ASIC exhibited erroneous behavior resulting in occasional memory corruption. The symptoms of the error (the error signature) were analyzed and a hypothesis of how the error occurred was formulated. This hypothesis involved certain complex corner- case conditions and particular event sequences. Extensive random simulation targeting the bug, however, did not succeed in validating this hypothesis. This was primarily a result of the fact that the bug appeared only in a specific, hard- to-reach hardware state whose occurrence was dependent on the precise timing of input stimuli. The bug was eventually isolated and reproduced through a process of formal verification based on model checking [2]. In particular, we used an approach based on targeting sets of conditions called waypoints, which are hypothesized by the user to necessarily occur en route to the bug in question. The bug was found to lie beyond the practical reach of standard (bounded) model checking from a reset state, which could only complete exhaustive analysis to 65 cycles within a three- day time limit and a 32-GB memory limit. Using the method described here, however, the hypothesized cause of the bug was analyzed to generate waypoints, which were then targeted by model checking. Once an input sequence was found that led to a given waypoint, a state trace was generated, then used as the initialization sequence for model checking to the next waypoint or to the eventual error condition. In this way, formal verification was guided to find the bug at a depth of 69 cycles from reset within ten hours of computation. Although the bug was only four cycles beyond the exhaustive analysis from reset, those additional cycles have high computational complexity, which would have made analysis using standard model checking impractical to complete within a reasonable amount of time. By using waypoints to reduce the amount of analysis needed to find the error trace, however, we were able to validate the hypothesized cause of the bug without a prohibitive expenditure of computational resources. This approach also allowed the analysis of conditions around the bug, and ultimately confirmed that the error would no longer occur after the design was corrected. In the remainder of this paper, we discuss each of the major steps in our method, including (1) converting error symptoms into assertions, (2) finding the right level of logic to analyze so that the bug can be exhibited, (3) choosing the appropriate places in the design to abstract logic, (4) setting the necessary input constraints, and (5) finding the trace to the bug. We also present runtime data comparing standard model checking of the error assertion to guided model checking using waypoints. * Correspondence to David.Shaw@DEShawResearch.com . David E. Shaw is also with the Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032.