1 MUCH: Exploiting Pairwise Hardware Event Monitor Correlations for Improved Timing Analysis of Complex MPSoCs Sergi Vilardell , , Isabel Serra ,⋆ , E. Mezzetti , Jaume Abella , Francisco J. Cazorla Universitat Polit` ecnica de Catalunya (UPC), Spain Barcelona Supercomputing Center (BSC), Spain Centre de Recerca Matem` atica (CRM), Spain Abstract—Measurement-based timing analysis techniques in- creasingly rely on the Performance Monitoring Units (PMU) of MPSoCs, as these units implement specialized Hardware Event Monitors (HEMs) that convey detailed information about mul- ticore interference in hardware shared resources. Unfortunately, there is an evident mismatch between the large number of HEMs (typically several hundreds) and the comparatively small number (normally less than ten) of Performance Monitoring Counters (PMCs) that can be configured to track HEMs in the PMU. Timing analysis normally require to observe a non-negligible number of HEMs per task from the same execution. However, due to the small number of PMCs, HEMs are necessarily collected across multiple runs that, despite intended to repeat the same experiment, carry out some significant variability (above 50% for some HEMs in relevant MPSoCs) caused by platform-intrinsic execution conditions. Therefore, blindly merging HEMs from different runs is not acceptable since they may easily correspond to significantly different conditions. To tackle this issue, the HRM approach has been proposed recently to merge HEMs from different runs accurately preserving their correlation w.r.t. one anchor HEM (i.e. processor cycles) building on order statistics. However, HRM do not always preserves the correlation between other pairs of HEMs that might be lost to a large extent. This paper copes with HRM limitations by proposing the MU lti- C orrelation H EM reading and merging approach (MUCH). MUCH builds on multivariate Gaussian distributions to merge HEMs from different runs while preserving pairwise correlations across each individual pair of HEMs simultaneously. Our results on an NXP T2080 MPSoC used for avionics systems show that MUCH largely outperforms HRM for an identical number of input runs. I. I NTRODUCTION The pervasive adoption of increasingly autonomous systems in domains such as automotive and avionics impose the use of multiprocessor system-on-chips (MPSoCs) to reach the performance levels required. However, while the use of MPSoCs increases in those domains [1], they also hinder software timing analysis as timing bounds become inherently dependent on multicore interference in the access of shared hardware resources like shared caches, memory controllers, buses, and on-chip interconnects [2], [3], [4], [5]. Several solutions mitigate multicore interference by ex- ploiting time segregation [6], [7], [8] (e.g. by partitioning applications into memory and computing phases) or space segregation (e.g. making different tasks access different hard- ware blocks) [9], [10], [11], [12], [13], [14], [15], [16]. The former approach is potentially intrusive with the application code and semantics, and thus is not applicable in many cases. The latter does not avoid multicore interference altogether that can still occur in hardware resources not visible at software level, such as interconnects, as well as buffers and internal queues to shared caches [17], [18]. Powerful measurement-based timing analysis solutions have been proposed for multicores for critical applications building on hardware and software profiling [19], [20], [21]. In this context, the Performance Monitoring Unit (PMU) in MPSoCs offers information relevant for timing analysis [22], enabling verification and validation (V&V) and software time budgeting of time-critical applications. For instance, the usage of some shared resources can be budgeted, monitored, and enforced using event quotas building on Hardware Event Monitors (HEMs) reached through the PMU [23], [24], [25], [26]. HEMs are used to monitor when tasks exceed their usage quotas which are suspended. Indeed, HEM information is already used as a pillar to certify critical avionics systems [27], so PMUs, and the HEMs they allow monitoring, become the basis of industrial-quality multicore interference mitigation and estimation techniques. PMUs typically support hundreds of HEMs, often related to the access counts for different types of accesses and to different hardware shared resources. However, HEMs can only be interfaced through a much lower number (typically less than 10) of user-visible performance monitoring counters (PMCs). Therefore, when the number of HEMs relevant for timing analysis is higher than the number of PMCs, HEMs need to be read in multiple experiments. For instance, the NXP T2080 MPSoC has around 20 HEMs just for monitoring the shared L2 cache and only 6 PMCs. This limitation confronts with the fact that task scheduling needs to budget how much each task is expected to access shared resources, which can only be done, in measurement- based approaches, by reading a large number of HEMs simul- taneously or, at least, consistently. However, collecting HEMs by performing several runs has some notable side effects on how the gathered values can be consistently merged: even if the very same experiment is intended to be repeated, the The final publication is available at ACM via http://dx.doi.org/10.1145/3412841.3441931