Parallel Computing 81 (2019) 22–31 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Dynamic look-ahead in the reduction to band form for the singular value decomposition Andrés E. Tomás a , Rafael Rodríguez-Sánchez b,∗ , Sandra Catalán a , Rocío Carratalá-Sáez a , Enrique S. Quintana-Ortí a a Dept. Ingeniería y Ciencia de Computadores, Universidad Jaume I, Castellón, Spain b Dep. Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain a r t i c l e i n f o Article history: Received 24 May 2018 Revised 11 September 2018 Accepted 28 November 2018 Available online 29 November 2018 Keywords: Singular value decomposition Two-stage reduction Look-ahead Runtime Dynamic scheduling Multicore processors a b s t r a c t We investigate the introduction of look-ahead in two-stage algorithms for the singular value decomposition (SVD). Our approach relies on a specialized reduction for the ﬁrst stage that produces a band matrix with the same upper and lower bandwidth instead of the conventional upper triangular-band matrix. In the case of a CPU-GPU server, this alter- native form accommodates a static look-ahead into the algorithm in order to overlap the reduction of the “next” panel on the CPU and the “current” trailing update on the GPU. For multicore processors, we leverage the same compact form to formulate a version of the al- gorithm that advances the reduction of “future” panels, yielding a dynamic look-ahead that overcomes the performance bottleneck that the sequential panel factorization represents. © 2018 Elsevier B.V. All rights reserved. 1. Introduction The Singular Value Decomposition (SVD) is a handy numerical tool for the computation of the matrix rank, the cal- culation of low-rank approximations to a matrix, and the solution of ill-conditioned least squares problems. These linear algebra kernels arise in medicine, geosciences, material sciences, crystallography, security, and information retrieval, among others [1–3]. Given a dense matrix A ∈ R m×n , the standard algorithm to compute its SVD ﬁrst obtains a reduced bidiagonal matrix B ∈ R m×n , using a sequence of Householder reﬂectors that are applied from the left- and right-hand sides of A as: A = U BV T , (1) where U ∈ R m×m , V ∈ R n×n are both orthogonal and B ∈ R m×n is upper bidiagonal [4]. (Without loss of generality, hereafter we will assume that m ≥ n. Otherwise, we can simply compute the SVD of A T .) The cost of this direct two-sided reduction (TSR) algorithm is 4n 2 (m − n/3) ﬂoating-point operations (ﬂops). The singular values of A are then computed from the bidiagonal matrix B, via the QD algorithm or some divide-and-conquer variant, adding a minor cost to the ﬂop count (unless the singular vectors are also required) [4]. An algorithm for the direct reduction to bidiagonal form A → B, via Householder reﬂectors, will necessarily perform a signiﬁcant amount of ﬂops in terms of the Level-2 BLAS (basic linear algebra subprograms) [6]. As a consequence, this ∗ Corresponding author. E-mail addresses: tomasan@uji.es (A.E. Tomás), rafaelrs@ucm.es (R. Rodríguez-Sánchez), catalans@uji.es (S. Catalán), rcarrata@uji.es (R. Carratalá-Sáez), quintana@uji.es (E.S. Quintana-Ortí). https://doi.org/10.1016/j.parco.2018.11.001 0167-8191/© 2018 Elsevier B.V. All rights reserved.