HRM: Merging Hardware Event Monitors for Improved Timing Analysis of Complex MPSoCs Sergi Vilardell Barcelona Supercomputing Center Universitat Polit` ecnica Catalunya sergi.vilardell@bsc.es Isabel Serra Centre de Recerca Matem` atica Barcelona Supercomputing Center isabel.serra@bsc.es Roberto Santalla, Enrico Mezzetti, Jaume Abella and Francisco J. Cazorla Barcelona Supercomputing Center name.surname@bsc.es Abstract—The Performance Monitoring Unit (PMU) in MP- SoCs is at the heart of the latest measurement-based timing analysis techniques in Critical Embedded Systems. In particular, hardware event monitors (HEMs) in the PMU are used as building blocks in the process of budgeting and verifying software timing by tracking and controlling access counts to shared resources. While the number of HEMs in current MPSoCs reaches hundreds, they are read via Performance Monitoring Counters whose number is usually limited to 4-8, thus requiring multiple runs of each experiment in order to collect all desired HEMs. Despite the effort of engineers in controlling the execution conditions of each experiment, the complexity of current MPSoCs makes it arguably impossible to completely remove the noise affecting each run. As a result, HEMs read in different runs are subject to different variability, and hence, those HEMs captured in different runs cannot be ‘blindly’ merged. In this work, we focus on the NXP T2080 platform where we observed up to 59% variability across different runs of the same experiment for some relevant HEMs (e.g. processor cycles). We develop a HEM reading and merging (HRM) approach to join reliably HEMs across different runs as a fundamental element of any measurement-based timing budgeting and verification technique. Our method builds on order statistics and the selection of an anchor HEM read in all runs to derive the most plausible combination of HEM readings that keep the distribution of each HEM and their relationship with the anchor HEM intact. I. I NTRODUCTION The complexity of processors in critical embedded systems (CES) continues to increase, with academic and industrial efforts devoted to analyze the use of multicores as the baseline computing solution in future CES. Multicores – and multipro- cessor system-on-chip (MPSoC) solutions in general – provide the increasing computing performance needs in CES domains like automotive [1] and avionics [2]. This is, in turn, motivated by the increasing computing requirements in autonomous CES that manage huge amounts of data, e.g. coming from radar, lidar, and cameras; and the implementation of compute- intensive AI algorithms [3]. The other side of the coin is that multicores complicate software timing analysis due to the inherent complexity of cutting-edge hardware functionalities and the difficulties in capturing the contention in the access to hardware shared resources, which causes tasks to affect each others’ timing behavior. Consolidated timing analysis approaches are challenged by the inherent complexity of multicore computing solutions [4], [5] that are increasingly adopted in the CES domain [1], [2]. The complexity of analysing such platforms principally emanates from the implications of multicore execution on the increasingly richer functionalities that CES are required to provide. This has led to a significant interest in providing industrially-amenable solutions to master contention and the entailed multicore interference. Preventing or controlling con- tention between concurrently-running tasks has been consid- ered as a promising direction with some approaches building on full segregation of accesses to the different blocks of memory-like resources [6], [7], including the (i) banks of shared on-chip caches and the (ii) banks/ranks in a DDR memory system [8], [9], [10], with solutions combining (i) and (ii) [11]. Other works propose changes to the application to precisely split its execution into memory and computation phases to facilitate explicit scheduling of task phases in a way to avoid contention [12], [13]. These approaches, while being embraced in industrial quality solutions [14], are not always applicable in practice, due to hardware characteristics or constraints on the applications semantics. In all these cases, interference can still arise in shared buses or shared buffers, tables and queues in the cache [15], and whenever altering applications’ semantics is not an option due to verification and validation (V&V) costs. In the NXP T2080, considered for adoption by the avionics industry [16], the number of shared components where interference can arise is overwhelming. Just in the L2 cache, we find the back invalidate buffer, reload table, reload fold queue, castout buffer, write data buffer, reload data buffer, and the snoop queue. In any case, regardless of the specific scenario, an anal- ysis approach is required to provide evidence that con- tention is actually avoided or mitigated (i.e. its impact can be bounded). Advanced measurement-based timing analysis approaches building on a variably complex combination of software and hardware profiling [17], [18] are being consid- ered as a promising analysis solution for functionally-rich and complex multicore platforms. Measurement-based approaches appear particularly appealing from an industrial (V&V) stand- point [19]. In this view, the Performance Monitoring Unit (PMU) provides the necessary entry point for retrieving the information required by the analysis. In fact, PMUs are be- coming instrumental for software timing budgeting and V&V. As a first example, it has been shown that existing signals in the AMBA AHB bus provide the required information