Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors Pedro Alonso Depto. de Sistemas Inform´ aticos y Computaci´ on, Universitat Polit` ecnica de Val` encia, Spain palonso@upv.es Sandra Catal´ an Depto. Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, Castell´ on, Spain. catalans@uji.es Jos´ e R. Herrero Dept. d’Arquitectura de Computadors, Universitat Polit` ecnica de Catalunya, Spain. josepr@ac.upc.edu Enrique S. Quintana-Ort´ ı Depto. Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, Castell´ on, Spain. quintana@uji.es Rafael Rodr´ ıguez-S´ anchez Depto. Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, Castell´ on, Spain. rarodrig@uji.es Abstract Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been proposed as a means to address the end of Dennard power scaling law. The idea of these architectures is to activate only the type (and number) of cores that satisfy the quality of service requested by the application(s) in execution while delivering high energy efficiency. For dense linear algebra problems though, performance is of paramount importance, asking for an efficient use of all computa- tional resources in the AMP. In response to this, we investigate how to exploit the asymmetric cores of an ARMv7 big.LITTLE AMP in order to attain high performance for the reduction to tridiago- nal form, an essential step towards the solution of dense symmet- ric eigenvalue problems. The routine for this purpose in LAPACK is especially challenging, since half of its floating-point arithmetic operations (flops) are cast in terms of compute-bound kernels while the remaining half correspond to memory-bound kernels. To deal with this scenario: 1) we leverage a tuned implementation of the compute-bound kernels for AMPs; 2) we develop and parallelize new architecture-aware micro-kernels for the memory-bound ker- nels; 3) and we carefully adjust the type and number of cores to use at each step of the reduction procedure. Categories and Subject Descriptors C.1.3 [Computer Systems Organization]: Other Architecture Styles—heterogeneous (hybrid) systems; C.4 [Performance of systems]: Performance and energy efficiency; G.4 [Mathematical Software]: Efficiency Keywords Condensed forms, symmetric eigenvalue problem, ba- sic linear algebra subprograms (BLAS), asymmetric multicore pro- cessors, multi-threading, workload balancing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. PMAM’17, February 04-08, 2017, Austin, TX, USA c ACM. ISBN 978-1-4503-4883-6/17/02. . . $15.00 DOI: http://dx.doi.org/10.1145/3026937.3026938 1. Introduction We target the reduction of a dense matrix to tridiagonal form as an initial step for the solution of dense symmetric eigenvalue problems on multi-threaded Asymmetric Multicore Processors (AMPs). Our interest in these architectures is motivated by the growing role of low-power multicore systems-on-chip (SoC) on the road to exas- cale systems, and the challenge that represents how to efficiently exploit the heterogeneous types of cores in these architectures for dense linear algebra (DLA) operations. Eigenvalue problems appear, among others, in computational quantum chemistry, finite element modeling, multivariate statistics, and density functional theory [10]. Efficient and numerically re- liable algorithms for the solution of dense symmetric eigenvalue problems consist of three stages [8]. The input n × n matrix A defining the problem is first reduced to a symmetric n × n tridiag- onal matrix T via a sequence of orthogonal similarity transforma- tions. A tridiagonal eigensolver, such as the MR 3 algorithm [2, 6], is then applied to accurately compute the eigenvalues of T , which coincide with those of A. Finally, when the eigenvectors of A are also desired, a back-transform recovers them from the eigenvectors of the tridiagonal problem. In this paper we propose several optimizations to the procedure that performs the reduction of a symmetric matrix A to tridiagonal form specifically designed to deliver an efficient execution on an ARM big.LITTLE AMP. The reason for addressing the first stage only is that such factorization costs O(n 3 ) floating-point arithmetic operations (flops) while the second stage, when based on the MR 3 algorithm, requires O(n 2 ) flops and, therefore, has a minor con- tribution to the total cost and performance of the eigensolver. In addition, although the third stage also requires O(n 3 ) flops, when