Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. SIAM J. SCI. COMPUT. c 2008 Society for Industrial and Applied Mathematics Vol. 31, No. 2, pp. 1156–1174 REDUCING FLOATING POINT ERROR IN DOT PRODUCT USING THE SUPERBLOCK FAMILY OF ALGORITHMS ANTHONY M. CASTALDO , R. CLINT WHALEY , AND ANTHONY T. CHRONOPOULOS Abstract. This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs between computational performance, memory usage, and error behavior. We show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike many such error-reducing algorithms, superblock requires no additional floating point operations and should be implementable with little to no performance loss, making it suitable for use as a performance-critical building block of a linear algebra kernel. Key words. dot product, inner product, error analysis, BLAS, ATLAS AMS subject classifications. 65G50, 65K05, 65K10, 65Y20, 68-04 DOI. 10.1137/070679946 1. Introduction. A host of linear algebra methods derive their error behav- ior directly from dot product. In particular, most high performance dense systems derive their performance and error behavior overwhelmingly from matrix multiply, and matrix multiply’s error behavior is almost wholly attributable to the underlying dot product that it is built from (sparse problems usually have a similar relationship with matrix-vector multiply, which can also be built from dot product). With the ex- pansion of standard workstations to 64-bit memories and multicore processors, much larger calculations are possible on even simple desktop machines than ever before. Parallel machines built from these hugely expanded nodes can solve problems of al- most unlimited size. The canonical dot product has a worst-case error bound that rises linearly with vector length. In the past this has not been deemed intolerable, but with problem sizes increasing it becomes important to examine the assumption that a linear rise in worst-case error is tolerable and to examine whether we can moderate it without a noticeable loss in performance. Dot product is an important operation in its own right, but due to performance considerations linear algebra implementations only rarely call it directly. Instead, most large-scale linear algebra operations call matrix multiply (aka GEMM, for gen- eral matrix multiply) [1, 3], which can be made to run very near the theoretical peak of the architecture. High performance matrix multiply can in turn be imple- mented as a series of parallel dot products, and this is the case in our own AT- LAS [31, 30] project, which uses GEMM as the building block of its high performance BLAS [15, 22, 10, 11, 9] implementation. Therefore, we are keenly interested in both the error bound of a given dot product algorithm and whether that algorithm is likely to allow for a high performance GEMM implementation. The implementation and performance of GEMM are not the focus of this paper, but we review them for Received by the editors January 11, 2007; accepted for publication (in revised form) July 25, 2008; published electronically December 17, 2008. This work was supported in part by National Science Foundation CRI grant CNS-0551504. http://www.siam.org/journals/sisc/31-2/67994.html Department of Computer Science, University of Texas at San Antonio, 6900 N. Loop, 1604 West, San Antonio, TX 78249 (castaldo@cs.utsa.edu, whaley@cs.utsa.edu, atc@cs.utsa.edu). 1156