Enable PCI Express Advanced Error Reporting in the Kernel Yanmin Zhang and T. Long Nguyen Intel Corporation yanmin.zhang@intel.com, tom.l.nguyen@intel.com Abstract PCI Express is a high-performance, general-purpose I/O Interconnect. It introduces AER (Advanced Error Re- porting) concepts, which provide significantly higher re- liability at a lower cost than the previous PCI and PCI-X standards. The AER driver of the Linux kernel provides a clean, generic, and architecture-independent solution. As long as a platform supports PCI Express, the AER driver shall gather and manage all occurred PCI Express errors and incorporate with PCI Express device drivers to perform error-recovery actions. This paper is targeted toward kernel developers inter- ested in the details of enabling PCI Express device drivers, and it provides insight into the scope of imple- menting the PCI Express AER driver and the AER con- formation usage model. 1 Introduction Current machines need higher reliability than before and need to recover from failure quickly. As one of failure causes, peripheral devices might run into errors, or go crazy completely. If one device is crazy, device driver might get bad information and cause a kernel panic: the system might crash unexpectedly. As a matter of fact, IBM engineers (Linas Vepstas and others) created a framework to support PCI error re- covery procedures in-kernel because IBM Power4 and Power5-based pSeries provide specific PCI device er- ror recovery functions in platforms [4]. However, this model lacks the ability to support platform indepen- dence and is not easy for individual developers to get a Power machine for testing these functions. The PCI Express introduces the AER, which is a world standard. The PCI Express AER driver is developed to support the PCI Express AER. First, any platform which supports the PCI Express could use the PCI Express AER driver to process device errors and handle error recovery ac- cordingly. Second, as lots of platforms support the PCI Express, it is far easier for individual developers to get such a machine and add error recovery code into specific device drivers. 2 PCI Express Advanced Error Reporting Driver 2.1 PCI Express Advanced Error Reporting Topol- ogy To understand the PCI Express Advanced Error Report- ing Driver architecture, it helps to begin with the ba- sics of PCI Express Port topology. Figure 1 illustrates two types of PCI Express Port devices: the Root Port and the Switch Port. The Root Port originates a PCI Express Link from a PCI Express Root Complex. The Switch Port, which has its secondary bus representing switch internal routing logic, is called the Switch Up- stream Port. The Switch Port which is bridging from switch internal routing buses to the bus representing the downstream PCI Express Link is called the Switch Downstream Port. Each PCI Express Port device can be implemented to support up to four distinct services: native hot plug (HP), power management event (PME), advanced error reporting (AER), virtual channels (VC). The AER driver development is based on the service driver framework of the PCI Express Port Bus Driver design model [3]. As illustrated in Figure 2, the PCI Express AER driver serves as a Root Port AER service driver attached to the PCI Express Port Bus driver. 2.2 PCI Express Advanced Error Reporting Driver Architecture PCI Express error signaling can occur on the PCI Ex- press link itself or on behalf of transactions initiated on the link. PCI Express defines the AER capability, which is implemented with the PCI Express AER Extended 297