Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors Pedro Alonso Depto. de Sistemas Inform´ aticos y Computaci´ on, Universitat Polit` ecnica de Val` encia, Spain palonso@upv.es Sandra Catal´ an Depto. Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, Castell´ on, Spain. catalans@uji.es Jos´ e R. Herrero Dept. d’Arquitectura de Computadors, Universitat Polit` ecnica de Catalunya, Spain. josepr@ac.upc.edu Enrique S. Quintana-Ort´ ı Depto. Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, Castell´ on, Spain. quintana@uji.es Rafael Rodr´ ıguez-S´ anchez Depto. Ingenier´ ıa y Ciencia de Computadores, Universidad Jaume I, Castell´ on, Spain. rarodrig@uji.es Abstract Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been proposed as a means to address the end of Dennard power scaling law. The idea of these architectures is to activate only the type (and number) of cores that satisfy the quality of service requested by the application(s) in execution while delivering high energy efﬁciency. For dense linear algebra problems though, performance is of paramount importance, asking for an efﬁcient use of all computa- tional resources in the AMP. In response to this, we investigate how to exploit the asymmetric cores of an ARMv7 big.LITTLE AMP in order to attain high performance for the reduction to tridiago- nal form, an essential step towards the solution of dense symmet- ric eigenvalue problems. The routine for this purpose in LAPACK is especially challenging, since half of its ﬂoating-point arithmetic operations (ﬂops) are cast in terms of compute-bound kernels while the remaining half correspond to memory-bound kernels. To deal with this scenario: 1) we leverage a tuned implementation of the compute-bound kernels for AMPs; 2) we develop and parallelize new architecture-aware micro-kernels for the memory-bound ker- nels; 3) and we carefully adjust the type and number of cores to use at each step of the reduction procedure. Categories and Subject Descriptors C.1.3 [Computer Systems Organization]: Other Architecture Styles—heterogeneous (hybrid) systems; C.4 [Performance of systems]: Performance and energy efﬁciency; G.4 [Mathematical Software]: Efﬁciency Keywords Condensed forms, symmetric eigenvalue problem, ba- sic linear algebra subprograms (BLAS), asymmetric multicore pro- cessors, multi-threading, workload balancing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from Permissions@acm.org. PMAM’17, February 04-08, 2017, Austin, TX, USA c ACM. ISBN 978-1-4503-4883-6/17/02. . . $15.00 DOI: http://dx.doi.org/10.1145/3026937.3026938 1. Introduction We target the reduction of a dense matrix to tridiagonal form as an initial step for the solution of dense symmetric eigenvalue problems on multi-threaded Asymmetric Multicore Processors (AMPs). Our interest in these architectures is motivated by the growing role of low-power multicore systems-on-chip (SoC) on the road to exas- cale systems, and the challenge that represents how to efﬁciently exploit the heterogeneous types of cores in these architectures for dense linear algebra (DLA) operations. Eigenvalue problems appear, among others, in computational quantum chemistry, ﬁnite element modeling, multivariate statistics, and density functional theory [10]. Efﬁcient and numerically re- liable algorithms for the solution of dense symmetric eigenvalue problems consist of three stages [8]. The input n × n matrix A deﬁning the problem is ﬁrst reduced to a symmetric n × n tridiag- onal matrix T via a sequence of orthogonal similarity transforma- tions. A tridiagonal eigensolver, such as the MR 3 algorithm [2, 6], is then applied to accurately compute the eigenvalues of T , which coincide with those of A. Finally, when the eigenvectors of A are also desired, a back-transform recovers them from the eigenvectors of the tridiagonal problem. In this paper we propose several optimizations to the procedure that performs the reduction of a symmetric matrix A to tridiagonal form speciﬁcally designed to deliver an efﬁcient execution on an ARM big.LITTLE AMP. The reason for addressing the ﬁrst stage only is that such factorization costs O(n 3 ) ﬂoating-point arithmetic operations (ﬂops) while the second stage, when based on the MR 3 algorithm, requires O(n 2 ) ﬂops and, therefore, has a minor con- tribution to the total cost and performance of the eigensolver. In addition, although the third stage also requires O(n 3 ) ﬂops, when