Debugging post-silicon fails in the IBM POWER8 bring-up lab M. Dusanapudi S. Fields M. S. Floyd G. L. Guthrie R. Kalla S. Kapoor L. S. Leitner C. F. Marino J. J. McGill A. Nahir K. Reick H. Shen K. L. Wright Debugging post-silicon fails continues to be a difficult problem that is becoming even more challenging as chips integrate more functionality and implement increasingly complicated functions. Additionally, the complexity of hardware systems, coupled with the difficulty in observing the state of the system that led to the failure, make the debugging effort a unique challenge. In this paper, we review the techniques and mechanisms used to facilitate effective debugging in the POWER8i processor post-silicon validation phase. We further describe several functional bugs and describe the debugging process that drove the identification of their root cause. Introduction Getting a product to market as fast as possible is important, especially in the technology industry. Due to the trends described by Moore’s Law, the sooner a computer ships, the more competitive it will be. Any new computer design must be validated and demonstrated that it can run free of errors before it can be released. The size and complexity of a chip such as the IBM POWER8* processor and the systems in which it ships makes finding all design bugs extremely difficult during the pre-silicon verification phase. We employ an advanced methodology and a suite of techniques in a variety of simulation environments for verifying the design of our POWER* processors [1, 2]. Despite these industry-leading verification capabilities, inevitably there are complex scenarios leading to fails that are not found until the real chip is running in the lab. This is because failing scenarios (so called Bcorner cases[ or Bwindow conditions[) in the design cannot always be anticipated with directed tests or discovered through formal or random verification techniques in the simulation environment. This is due in part to the fact that there is a finite amount of simulation resource and time available between the time when the simulation model of the chip design is available and its release to be fabricated into silicon. The fact that some bugs are not found until real silicon is available calls for pre-planning of more than one tape-out (i.e., the fabrication of more than one version of the chip). During the time frame between tape-outs, a significant effort is placed on finding remaining bugs in the design to deliver a high quality product. This phase is commonly known as the post-silicon validation phase. Post-silicon validation highly benefits from the speed and scale of real fabricated processors. Within a few days of receiving silicon, more hours of runtime are accumulated than all previous cycles of runtime tested during the pre-silicon verification phase. The team can construct real system configurations, larger and more complex than can be included in a simulation model, and run many types of workloads on them, including test exercisers [3] that are developed explicitly to generate scenarios that will expose hard-to-find bugs in the design. However, running on the real processor has significant disadvantages as well. The lack of observability into the state of the design in the hardware lab makes the analysis of failed tests a unique challenge. In this paper, we provide an overview of the debugging mechanisms embedded in the POWER8 processor. We describe the structure of each mechanism together with the capabilities it provides the validation team. Furthermore, we provide a walkthrough of the debugging process of several real system-level bugs found in the POWER8 post-silicon validation phase. For each such bug we explain how the validation team leveraged the debugging mechanisms to facilitate an effective debug process. Of course, all the bugs described in this paper have been fixed and no longer exist in POWER8-based systems. The rest of the paper is organized as follows. In the next section, we provide a brief overview on debugging ÓCopyright 2015 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. M. DUSANAPUDI ET AL. 12 : 1 IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 12 JANUARY/FEBRUARY 2015 0018-8646/15 B 2015 IBM Digital Object Identifier: 10.1147/JRD.2014.2380272