Robust Ranking of Linear Algebra Algorithms via Relative Performance Aravind Sankaran AICES RWTH Aachen University Aachen, Germany aravind.sankaran@rwth-aachen.de Paolo Bientinesi Department of Computer Science Ume˚ a Universitet Ume˚ a, Sweden pauldj@cs.umu.se Abstract—For a given linear algebra problem, we consider those solution algorithms that are mathematically equivalent to one another, and that mostly consist of a sequence of calls to kernels from optimized libraries such as BLAS and LAPACK. Although equivalent (at least in exact precision), those algorithms typically exhibit significant differences in terms of performance, and naturally, we are interested in finding the fastest one(s). In practice, we often observe that multiple algorithms yield compa- rable performance characteristics. Therefore, we aim to identify the subset of algorithms that are reliably faster than the rest. To this end, instead of quantifying the performance of an algorithm in absolute terms, we present a measurement-based approach that assigns a relative score to the algorithms in comparison to one another. The relative performance is encoded by sorting the algorithms based on pair-wise comparisons and ranking them into equivalence classes, where more than one algorithm can obtain the same rank. We show that the relative performance leads to robust identification of the fastest algorithms, that is, reliable identifications even with noisy system conditions. Index Terms—performance analysis, performance modelling, benchmarking, sampling I. I NTRODUCTION Given a set A of mathematically equivalent linear algebra algorithms, we aim to identify the subset F⊆A containing all those algorithms that are “equivalently” fast to one another, and “noticeably” faster than the algorithms in the subset A\F . We will clarify the meaning of “equivalent” and “noticeable” shortly; for now, we simply state that in order to identify F , we develop an approach that assigns a higher score to the algorithms in F compared to those in A\F . Instead of aiming for a value that captures the quality of an algorithm in absolute terms, we seek to compute a relative score that compares the current algorithm with respect to the fastest algorithm(s) in A. We refer to such scores as “relative performance estimates”. It is well known that execution times are influenced by many factors, and that repeated measurements, even with same input data and same cache conditions, often result in different execution times [1]–[3]. Therefore, finding the best algorithm is a task that involves comparing distributions of execution time. In common practice, time measurements are summarized Financial support from the Deutsche Forschungsgemein- schaft (German Research Foundation) through grants GSC 111 and IRTG 2379 is gratefully acknowledged. into statistical estimates (such as minimum or median execu- tion time), which are then used to compare algorithms [4]. But when system noise begins to have significant impact on program execution, it becomes difficult to summarize the performance into a single number; as a consequence, the comparisons are not consistent when the time measurements are repeated, and this leads to inconsistency in the ranking of algorithms. In order for one algorithm to be better (or worse) than the other, there should be “noticeable“ difference in their distributions (Figure 1a). On the other hand, the performance of algorithms are comparable if their distributions are “equiv- alent“ or have significant overlap (Figure 1b). Therefore, the result of comparing two algorithms can fall into one of the three categories - better, worse or equivalent. In this paper, we use this three-way comparison to cluster the set of algorithms A into performance classes and construct a ranking that is consistent (or robust) despite noisy system conditions. The algorithms in A represent different, alternative ways of computing the same mathematical expression. In exact arithmetic, those algorithms would all return the same quantity. For instance, in the expression y k := H T y +(I n − H T H)x k , which appears in an image restoration application [5], if the product H T H is computed explicitly, the code would perform a O(n 3 ) matrix-matrix multiplication; by contrast, by applying distributivity, 1 one can rewrite this assignment as y k := H T (y − Hx k )+ x k , obtaining an alternate algorithm which computes the same expression by using only matrix- vector multiplications, for a cost of O(n 2 ). In this example, the two algorithms differ in the order of magnitude of floating point operations (FLOPs), hence noticeable differences in terms of execution times are naturally expected. However, two algorithms may significantly differ in execution times even if they perform the same number of FLOPs, and it is even possible that higher FLOP counts result in faster executions [6]. In practice, one has to consider and compare more than just two algorithms. For instance, for the generalized least square problem y := (X T S -1 X) -1 X T S -1 z, it is possible to find more than 100 different algorithms that compute the solution 1 In general, distributivity does not always lead to lower FLOP count. 1 arXiv:2010.07226v1 [cs.PF] 14 Oct 2020