Sick But Not Dead Testing - A New Approach to System Test Tara Astigarraga 1 , Michael Browne 2 , Lou Dickens 3 , Systems and Technology Group IBM 1 Rochester, NY 14626 2 Poughkeepsie, NY 12601 3 Tucson, AZ 85744 {asti, browne, dickens}@us.ibm.com Abstract— Enterprise data center implementations make significant investments in high availability configurations, redundant hardware, software and Input / Output (I/O) paths that are in many failure scenarios quite successful. However, in spite of all that investment clients are still facing unexpected outages and performance impacts related to a phenomenon referred to as Sick but not Dead (SBND) errors. SBND errors are sometimes lumped together in a category with other related errors including transient errors, partial failure scenarios and soft errors. While SBND errors do have many common characteristics with the errors described above, there are key differences and environment impacts which we will explore further in this paper. We will also present new proactive techniques, inject scenarios and methods to identify, characterize and address SBND failures including cross-component impacts and failures. Keywords-Software Testing; Sick but not Dead; Software Engineering; Partial Failure; Transient Error; Soft Failure; SAN Test; System Test. I. INTRODUCTION AND MOTIVATION Despite high availability (HA) configurations, customers are still experiencing outages and severe performance declines in their environments. These outages typically show no signs of hard component failures for which the HA infrastructure would react to and provide recovery. We classify these errors as Sick but not Dead (SBND) failures. These errors are often the hardest failures to identify and can have sporadic but lasting impacts on the environment as a whole. SBND failures currently represent 80% of business impact, but only about 20% of the problems [2]. SBND errors are sometimes lumped together in a category with other related errors including transient errors, partial failure scenarios and soft errors. While SBND errors do have many common characteristics with the errors described above, there are key differences as well. SBND errors by definition derive from a component within the I/O path that is ‘sick’ meaning behaving in an unorthodox or partially failed fashion but not completely ‘dead’ or hard failed. Depending on the component exhibiting the SBND characteristics, the symptoms can vary, come and go at different intervals and it can take anywhere from seconds to months for the component to finally reach a hard fail state. It is this in-between time when the component is defined as SBND. Complex customer solutions and environments utilizing mixed vendor products and technologies create textbook scenarios for SBND failures to occur. Many products are intolerant of misbehavior of other devices and most failure paths deal promptly with hard failure scenarios, but are slower and more cautious to react to partially failed, misbehaving, or SBND components in a Storage Area Network (SAN). With current field solutions, problem determination related to SBND failure scenarios is complex, time consuming and often requires special problem determination lab trace tools and a team of cross-vendor product and solution experts. Current resolutions to SBND failure scenarios are almost always reactive vs. proactive. In our system test and SAN labs we have been developing new proactive techniques, protocol inject scenarios and methods to identify, characterize and address SBND failures including cross-component impacts and failures across the I/O path. Our current research related to SBND defects reported shows that the highest number of SBND problems exists along the I/O path. While related problems do occasionally exist within specific internal sever paths they are significantly less frequent, easier to debug and typically contained to a single server and handled via embedded HA mechanism. Systems generally behave properly when failures are solid or hard failures. It is when components act SBND that system availability is often at risk. In these scenarios failover or recovery mechanisms often do not behave as we should expect them to. Often times the problems are corner cases where they are not easily reproducible and hard to trouble shoot, but continue to plague customer environments. It should also be noted that SBND problems are not something that occur in a particular vendor or product set, but rather a system level event that occurs when one (or more) component(s) in the environment does not always behave consistently. Since the problem does not relate to a particular vendor or component issue it is not a simple fix but rather a system level event that must be fully understood, tested and addressed by all vendors in a distributed systems SAN environment. The focus of this paper will be on SBND failures related to the I/O path in distributed systems Fibre Channel (FC) SAN and Fibre Channel over Ethernet (FCoE) environments. In 17 Copyright (c) IARIA, 2012. ISBN: 978-1-61208-233-2 VALID 2012 : The Fourth International Conference on Advances in System Testing and Validation Lifecycle