Tuning Performance via Metrics with Expectations Ahmad Yasin , Avi Mendelson , and Yosi Ben-Asher Abstract—Modern server systems employ many features that are difficult to exploit by software developers. This paper calls for a new performance optimization approach that uses designated metrics with expected optimal values. A key insight is that expected values of these metrics are essential in order to verify that no performance is wasted during incremental utilization of processor features. We define sample primary metrics for modern architectures and present three distinct techniques that help to determine their optimal values. Our preliminary results successfully provide 2x-4x extra speedup during tuning of commonly-used software optimizations on the matrix-multiply kernel. Additionally, our approach helped to identify counter-intuitive causes that hurt multicore scalability of an optimized deep-learning benchmark on a Cascade Lake server. Index Terms—Code tuning, measurements, micro-architecture, multi-core/single- chip multiprocessors, optimization, performance analysis, SIMD processors Ç 1 INTRODUCTION MOTIVATION. Mainstream servers employ multiple features in order to achieve high-performance: multicore, multi-level caches, power- ful vector units, and complicated microarchitectures. On one hand, it is required to utilize all these features to reach peak performance [1]. On the other hand, it is hard for developers to efficiently utilize all these features. Thus, techniques to identify potential optimizations and guide their tuning are valuable for the software community. Challenges. High-performance processors include many features that involve tradeoffs. For example, prefetching can increase effi- ciency by improving the utilization of the memory subsystem, unless the memory bus is overloaded or the workload is not memory-bound. Unfortunately, the average software developer lacks the experience and the knowledge of the internals involved in these tradeoffs. To aid in that, optimizations are included in compilers and runtime libraries, many of which come with param- eters that require tuning. In modern architectures, the situation can be even more confusing. For example, activating multiple cores or utilizing vector units may inversely impact the frequency of a Sky- lake server [2], [3]. Lastly, techniques that abstract the details of the underlying hardware are necessary, given the limited effort that can be allocated for performance optimizations [4]. The main idea of this paper is that designated metrics with prede- fined optimal values are critical for performance optimizations. We define a sample of such metrics and some techniques to determine accurate optimal values for them. Examples of metrics that we use include number of cache misses per floating point (FP) operations in a cache line, and time spent in operating system (OS) services. Evaluation. We examine many optimizations for a key HPC kernel—matrix-matrix multiplication (mmm). We show that significant speedup is achieved with a tuned implementation of each optimization. We demonstrate the latter on three commonly used optimizations: tiling (cache-blocking), parallelization, and vectorization. We illustrate the idea over the famous Roofline model [1], which plots performance as a function of Operation Intensity (OI is the number of FP operations per byte of memory traffic), under diagonal and flat roofs, as shown in Fig. 1. The vertical dashed lines represent optimal values for the OI metric as determined by our method. The square points correspond to a textbook version of mmm where the optimal OI—of 0.25 operation-per-byte for reading from external memory (DDR) [5]—is not reached unless Loop Interchange is applied. The diamond (Tiling) points correspond to an L2 cache blocked version of mmm where again the optimal OI is not reached without careful tuning. In addition, we demonstrate how our method’s optimal value for multicore performance can help to distill the cost of software- originated non-scalability, when analyzing a full machine-learning (ML) application. Furthermore, we pinpoint unexpected causes of that, including work-imbalance and noise induced by background system activities. In all cases, significant performance would be wasted unless the expected metric’s value is reached. This paper makes the following contributions: A method to guide tuning of performance optimizations using designated metrics with expectations. Three innovative techniques to determine accurate optimal values for different types of performance metrics. Two evaluations of the method: tuning of popular software optimizations for a key kernel as well as an insightful anal- ysis of a real ML workload. 2 METHOD This section generalizes the key principles that can help pro- grammers to set accurate expectations (optimal values) for a repre- sentative set of performance metrics. The following section demonstrates the principles for mmm and an ML application (to make the ideas easier to follow). We use the notion of b x ðaÞ to represent some metric b for a workload a under the parameter/granularity x. b b denotes the opti- mal value of b; for example, it is reported that c OI ðmmmÞ¼ N=12 when a square matrix of size N fits in cache [5]. We illustrate the method through metrics of representative domains of a server sys- tem: application runtime, the memory subsystem, multicore, and in-core (utilization of the processor core engine). A key insight is that normalizing these metrics for a unit of work helps to deter- mine the optimal value. We discuss initially HPC-like FP kernels to simplify the presentation. Memory Metrics. We use both OI Lx and estimated cache Misses Lx . The latter can be calculated for typical HPC kernels: if the dataset exceeds the cache, we claim it is sufficient to consider the hottest stream and present a technique to find it. d Misses Lx ðaÞ can be calcu- lated as total FLOP count (known) divided by FP operations per a single cacheline filled into cache-level x ðLxÞ as a result of the hot- test stream—see equation#I a in Table 1. If tiling is used, only com- pulsory misses [6] need to be considered (equation#I b ). Technique#1: determining hot stream—Assign weights to loops such that a loop has 10x the weight of its outer loop. Hotness of a stream (e.g., array accesses) would be inferred by multiplying the weights of the loop indices it references. See Listing#1 for a simplified example. Multicore Metric. The idea is to predict multicore’s performance ( d Score n ) based on the performance of a small number of cores (Score s ) when considering platform-specific scaling, as defined by A. Yasin is with Intel Corporation and also with the University of Haifa, Haifa 3498838, Israel. E-mail: ahmad.yasin@intel.com. Y. Ben-Asher is with the University of Haifa, Haifa 3498838, Israel. E-mail: yosi@cs.haifa.ac.il. A. Mendelson is with the Technion, Haifa 3200003, Israel. E-mail: avi.mendelson@tce.technion.ac.il. Manuscript received 18 Jan. 2019; revised 30 Mar. 2019; accepted 22 Apr. 2019. Date of publication 13 May 2019; date of current version 27 June 2019. (Corresponding author: Ahmad Yasin.) For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/LCA.2019.2916408 IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 18, NO. 2, JULY-DECEMBER 2019 91 1556-6056 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.