Atune-IL: An Instrumentation Language for Auto-Tuning Parallel Applications Christoph A. Schaefer, Victor Pankratius, Walter F. Tichy Institute for Program Structures and Data Organization (IPD) University of Karlsruhe, D-76128 Karlsruhe, Germany Abstract Automatic performance tuning (auto-tuning) has been used in parallel numerical applications for adapting per- formance-relevant parameters. We extend auto-tuning to general-purpose parallel applications on multicores. This paper concentrates on Atune-IL, an instrumentation language for specifying a wide range of tunable parame- ters for a generic auto-tuner. Tunable parameters include the number of threads and other size parameters, but also choice of algorithms, numbers of pipeline stages, etc. A case study of Atune-IL’s usage in a real-world application with 13 parameters and over 24 million possible value combinations is discussed. With Atune-IL, the search space was reduced to 1,600 combinations, and the lines of code needed for instrumentation were reduced from more than 700 to 25. 1 Introduction As multicore platforms become ubiquitous, many software applications have to be parallelized and tuned for performance. In the past one could afford to optimize code by hand for certain parallel machines. Manual tuning must be automated in the multicore world with mass mar- kets for parallel computers. The reasons are manifold: the user community has grown significantly, just as the diver- sity of application areas for parallelism. In addition, the available parallel platforms differ in many respects, e.g., in number or type of cores, number of simultaneously executing hardware threads, cache architecture, available memory, or employed operating system. Thus, the num- ber of targets to optimize for has exploded. Even worse, optimizations made for a certain machine may cause a slowdown on another machine. At the same time, multicore software has to remain portable and easy to maintain, which means that hard- wired code optimizations must be avoided. Libraries with already tuned code bring only small improvements, as the focus of optimization is often narrowed down to specific problems or algorithms [11]. Moreover, libraries are high- ly platform-specific, and require interfaces to be agreed upon. To achieve good overall performance, there seems to be no way around adapting the whole software archi- tecture of a parallel program to the target architecture. Automatic performance tuning (auto-tuning) [5], [10], [19] is a promising systematic approach in which parallel programs are written in a generic and portable way, while their performance remains comparable to that of manual optimization. In this paper, we focus on the problem how to connect an auto-tuner to a parallel application. We introduce Atune-IL, a general instrumentation language that is used throughout the development of a parallel program to de- fine tunable parameters. Our tuning instrumentation lan- guage is based on language-independent #pragma annota- tions that are inserted into the code of an existing parallel application. Atune-IL has powerful features that go far beyond related work in numerics [5], [19], [14]. Our ap- proach is aimed to improve the software engineering of general-purpose parallel applications; it provides con- structs to specify tunable variables, add meta-information on nested parallelism (to allow optimization on several abstraction layers), and vary the program architecture. All presented features are fully functional and have been posi- tively evaluated in the context of a large commercial ap- plication analyzing biological data on an eight-core ma- chine. With our approach, we were able to reduce the code size required for instrumentation by 96%, and the auto-tuner’s search space by 99%. The paper is organized as follows. Section 2 provides essential background knowledge on auto-tuning general purpose parallel applications. Section 3 introduces Atune- IL, our tuning instrumentation language. Section 4 shows how program variants are generated automatically for tuning iterations. The mechanisms employed for perfor- mance feedback to the auto-tuner are sketched in section 5. Section 6 illustrates in an extensive case study how our approach was applied in the context of a real-world, paral- lel application, and discusses quantitative and qualitative improvements. Section 7 compares our approach to re- lated work. Section 8 offers a conclusion. 2 Automatic Performance Tuning Search-based auto-tuners have been proposed in the li- terature to deal with the complexity faced by compilers to produce parallel code [2], [5], [15], [16], [17]. Compiler optimizations are often based on static code analysis and are part of a compiler’s internals. With the growing archi- tectural variety of parallel systems, it is obvious that ex- tending a compiler with optimization strategies for every platform becomes hardly feasible. Technical Report 2009-2 Institute for Program Structures and Data Organization (IPD) University of Karlsruhe (TH), Germany, Januar 2009