On the Reliability of Hardware Event Monitors in MPSoCs for Critical Domains Javier Barrera †* , Leonidas Kosmidis † , Hamid Tabani † , Enrico Mezzetti † , Jaume Abella † , Mikel Fernandez † , Guillem Bernat § and Francisco J. Cazorla † † Barcelona Supercomputing Center, Spain * Universitat Politecnica de Catalunya, Spain § Rapita Systems Ltd., UK Abstract—Performance Monitoring Units (PMUs) are at the heart of most-advanced timing analysis techniques to control and bound the impact of contention in Commercial Off-The-Shelf (COTS) SoCs with shared resources (e.g. GPUs and multicore CPUs). In this paper, we report discrepancies on the values obtained from the PMU event monitors and the number of events expected based on PMU event description in the processor’s official documentation. Discrepancies, which may be either due to actual errors or inaccurate specifications, make PMU readings unreliable. This is particularly problematic in consideration of the critical role played by event monitors for timing analysis in domains such as automotive and avionics. This paper proposes a systematic procedure for event monitor validation. We apply it to validate event monitors in the NVIDIA Xavier and TX2, and the Zynq UltraScale+ MPSoC. We show that, while some event monitors count as expected, this is not the case for others whose discrepancies with expected values we analyze. I. I NTRODUCTION Performance-improving features, until recently only used in processors for the high-performance domain, are increasingly used in processors in domains like automotive [14]. Those features include multicores, multi-level caches, complex on- chip networks, and accelerators, among which GPUs have a dominant position [4], [35], [27]. This transition from simple micro-controllers to complex micro-processors is driven by the unprecedented performance requirements of complex critical software to support functionalities like autonomous driving in automotive and more autonomous missions in space [6], [37]. Commercial Off-The-Shelf (COTS) processors in critical domains have limited hardware support for time predictability. This includes automotive processors and SoCs such as the NVIDIA DrivePX (Parker and Xavier SoCs), RENESAS R- Car H3, QUALCOMM SnapDragon 820, and Intel Go. Sim- ilar concerns also arise on SoCs such as the Xilinx Zynq UltraScale+, increasingly considered for avionics and railway applications among others [38]. Trying to achieve full isolation by software resorting for example to page (memory) colouring techniques 1 has been shown insufficient since interference still exists in shared queues and buffers [32]. Promising software solutions for multicores build on event quota budgeting, mon- itoring, and enforcement [29], [40], [34], [11] to establish and 1 Colouring is a well-known technique to segregate accesses to the different blocks of memory-like resources [17], like banks of the shared last-level on- chip cache, the banks and ranks in a DDR memory system [26], [24], [33], or even combined cache-memory segregation [16]. enforce budgets on task (core) ‘maximum shared resources utilization’. The latter is measured with event monitors, e.g. last-level cache misses are used to capture the task’s memory utilization. The system software monitors task’s activities via the hardware event monitors offered by processors PMUs and suspends or slows down task’s execution when their assigned budget is about to be exhausted. Problem Statement. Existing software approaches and solutions for quota (event) monitoring and enforcement, as well as software debugging processes, build on the naive assumption that event monitors and their documentation are always correct. In fact, the trustworthiness of event monitors in COTS processors has not been questioned yet in the real-time research community, despite their critical role as functional and non-functional verification means. The validity of all quota-based software solutions cannot be sustained without providing evidence of a correct functioning of the event monitors, according to the specification available in the official documentation. The lack of such supportive evidence ultimately jeopardizes the timing arguments and potentially invalidates the evidence gathered to successfully undergo the mandatory timing V&V process, in accordance with safety regulations. Contribution. In this paper we take an initial step towards reconciling PMU verification (often disregarded) with its crit- ical role for timing analysis. Our contributions are as follows: (1) Analysis of Event Monitor Correctness. We analyse sev- eral event monitors present i) in the GPU of the NVIDIA AGX Xavier and TX2 development boards, and ii) in the CPU of the Xilinx UltraScale+ SoC, and we assess them against their technical specification. Our goal is not to cover all event monitors supported by those architectures, which comprise several hundreds [15]. We aim, instead, at illustrating that some event monitors might not behave as expected. For specific code snippets, we show that some discrepancies occur between observed event counts and the values that a performance analyst would expect based on the event monitors specification provided in the corresponding product manuals. Such evidence supports our claim that OEMs/TIER/timing analysis companies cannot blindly trust event monitors without a preliminary validation. (2) Monitor Validation Process. We describe the steps in a manual validation process that helps to validate the event mon- 1 The final publication is available at ACM via http://dx.doi.org/10.1145/3341105.3373955