Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra FLAME Working Note #73 Pedro Alonso * 1 , Manuel F. Dolz †2 , Francisco D. Igual ‡ 3 , Rafael Mayo § 4 , and Enrique S. Quintana-Ort´ ı ¶4 1 Depto. de Sistemas Inform´ aticos y Computaci´ on, Universitat Polit` ecnica de Val` encia, 46.022–Valencia, Spain 2 Dept. of Informatics, University of Hamburg, 22.527–Hamburg, Germany 3 Depto. de Arquitectura de Computadores y Autom´ atica, Universidad Complutense de Madrid, 28.040–Madrid, Spain 4 Depto. de Ingenier´ ıa y Ciencia de Computadores, Universitat Jaume I, 12.071–Castell´ on, Spain June 2, 2014 Abstract The road towards Exascale Computing requires a holistic eﬀort to address three diﬀerent challenges simultaneously: high performance, energy eﬃciency, and programmability. The use of runtime task schedulers to orchestrate parallel executions with minimal developer intervention has been introduced in recent years to tackle the programmability issue while maintaining, or even improving, performance. In this paper, we enhance the SuperMatrix runtime task scheduler integrated in the libflame library in two diﬀerent directions that address high performance and energy eﬃciency. First, we extend the runtime by accommodating hybrid parallel executions and managing task priorities for dense linear algebra operations, with remarkable performance improvements. Second, we introduce techniques to reduce energy consumption during idle times inherent to parallel executions, attaining important energy savings. In addition, we propose a power consumption model that can be leveraged by runtime task schedulers to make decisions based not only on performance, but also on energy considerations. 1 Introduction With the introduction of the CUDA [1] and OpenCL [2] programming standards, graphics processing units (GPUs) are being increasingly adopted for their aﬀordable price, favorable energy-performance balance and, due to their vast amount of hardware concurrency, the excellent acceleration factors demonstrated for many compute-intensive applications with ample data-parallelism [3, 4]. Nevertheless, this type of hardware accelerators has to be attached to a conventional (multicore) processor (or CPU), and eﬃciently programming a heterogeneous platform consisting of one to several multicore processors and multiple GPUs is still a considerable challenge. The reason is that, when dealing with these parallel (hybrid) systems, in addition to facing the programming diﬃculties intrinsic to concurrency, the developer has to cope with the existence of multiple memory address spaces, and the diﬀerent programming models. * palonso@dsic.upv.es † dolzm@icc.uji.es ‡ ﬁgual@fdi.ucm.es § mayo@icc.uji.es ¶ quintana@icc.uji.es 1