Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
SIAM J. SCI. COMPUT. c 2008 Society for Industrial and Applied Mathematics
Vol. 31, No. 2, pp. 1156–1174
REDUCING FLOATING POINT ERROR IN DOT PRODUCT USING
THE SUPERBLOCK FAMILY OF ALGORITHMS
∗
ANTHONY M. CASTALDO
†
, R. CLINT WHALEY
†
, AND ANTHONY T.
CHRONOPOULOS
†
Abstract. This paper discusses both the theoretical and statistical errors obtained by various
well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more
general framework that we have named superblock which subsumes them and permits a practitioner
to make trade-offs between computational performance, memory usage, and error behavior. We
show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike
many such error-reducing algorithms, superblock requires no additional floating point operations
and should be implementable with little to no performance loss, making it suitable for use as a
performance-critical building block of a linear algebra kernel.
Key words. dot product, inner product, error analysis, BLAS, ATLAS
AMS subject classifications. 65G50, 65K05, 65K10, 65Y20, 68-04
DOI. 10.1137/070679946
1. Introduction. A host of linear algebra methods derive their error behav-
ior directly from dot product. In particular, most high performance dense systems
derive their performance and error behavior overwhelmingly from matrix multiply,
and matrix multiply’s error behavior is almost wholly attributable to the underlying
dot product that it is built from (sparse problems usually have a similar relationship
with matrix-vector multiply, which can also be built from dot product). With the ex-
pansion of standard workstations to 64-bit memories and multicore processors, much
larger calculations are possible on even simple desktop machines than ever before.
Parallel machines built from these hugely expanded nodes can solve problems of al-
most unlimited size. The canonical dot product has a worst-case error bound that
rises linearly with vector length. In the past this has not been deemed intolerable, but
with problem sizes increasing it becomes important to examine the assumption that
a linear rise in worst-case error is tolerable and to examine whether we can moderate
it without a noticeable loss in performance.
Dot product is an important operation in its own right, but due to performance
considerations linear algebra implementations only rarely call it directly. Instead,
most large-scale linear algebra operations call matrix multiply (aka GEMM, for gen-
eral matrix multiply) [1, 3], which can be made to run very near the theoretical
peak of the architecture. High performance matrix multiply can in turn be imple-
mented as a series of parallel dot products, and this is the case in our own AT-
LAS [31, 30] project, which uses GEMM as the building block of its high performance
BLAS [15, 22, 10, 11, 9] implementation. Therefore, we are keenly interested in both
the error bound of a given dot product algorithm and whether that algorithm is
likely to allow for a high performance GEMM implementation. The implementation
and performance of GEMM are not the focus of this paper, but we review them for
∗
Received by the editors January 11, 2007; accepted for publication (in revised form) July 25,
2008; published electronically December 17, 2008. This work was supported in part by National
Science Foundation CRI grant CNS-0551504.
http://www.siam.org/journals/sisc/31-2/67994.html
†
Department of Computer Science, University of Texas at San Antonio, 6900 N. Loop, 1604 West,
San Antonio, TX 78249 (castaldo@cs.utsa.edu, whaley@cs.utsa.edu, atc@cs.utsa.edu).
1156