Iterative Compilation with Kernel Exploration D.Barthou 2 , S.Donadio 12 , A.Duchateau 2 , W.Jalby 2 , and E. Courtois 3 1 Bull SA Company, France 2 Universit´ e de Versailles, France 3 CAPS Entreprise, France Abstract. The increasing complexity of hardware mechanisms for recent processors makes high performance code generation very challenging. One of the main issue for high performance is the optimization of memory accesses. General purpose compilers, with no knowledge of the application context and approximate memory model, seem inappropriate for this task. Combining application- dependent optimizations on the source code and exploration of optimization parameters as it is achieved with ATLAS, has been shown as one way to improve performance. Yet, hand-tuned codes such as in the MKL library still outperform ATLAS with an important speed-up and some eﬀort has to be done in order to bridge the gap between performance obtained by automatic and manual optimizations. In this paper, a new iterative compilation approach for the generation of high performance codes is proposed. This approach is not application-dependent, compared to ATLAS. The idea is to separate the memory optimization phase from the computation optimization phase. The ﬁrst step automatically ﬁnds all possible decompositions of the code into kernels. With datasets that ﬁt into the cache and simpliﬁed memory accesses, these kernels are simpler to optimize, either with the compiler, at source level, or with a dedicated code generator. The best decomposition is then found by a model-guided approach, performing on the source code the required memory optimizations. Exploration of optimization sequences and their parameters is achieved with a meta-compilation language, X language. The ﬁrst results on linear algebra codes for Itanium show that the perfor- mance obtained reduce the gap with those of highly optimized hand-tuned codes. 1 Introduction The increasing complexity of hardware mechanisms incorporated in modern processors makes high per- formance code generation very challenging. One of the key diﬃculty in the code optimization process is that several issues have to be simultaneously addressed/optimized: for example maximizing instruction level parallelism (ILP) and optimizing data reuse across multilevel memory hierarchies. Moreover, very often, a code transformation will be beneﬁcial to one aspect while it will be detrimental for the other one. The whole problem worsens because the issues are tackled by diﬀerent levels of the compiler chain: most of the ILP is optimized by the backend while data locality optimization is performed at a higher level. A good example for highlighting all of these problems is the simple matrix multiply operation. Al- though the code is fairly simple, none of the recent compilers is really able to generate performance close to hand coded routines. For dealing with this problem, Dongarra et. al.[18] have developed a specialized code generator (ATLAS) combining iterative techniques and experimentation. ATLAS is a very good progress in the right direction (it outperforms most of the compilers) but very often it still lags behind hand coded routines. Recently, ATLAS has been improved by replacing the iterative search by an adapted cost model enable to generate code with nearly the same performance [21]. But even with these recent improvements, vendor [8,16] or hand-tuned BLAS3 [11] still outperforms ATLAS compiled codes and,