Mining Opportunities for Code Improvement in a Just-In-Time Compiler Adam Jocksch 1 , Marcel Mitran 2 , Joran Siu 2 , Nikola Grcevski 2 , and Jos´ e Nelson Amaral 1 1 Department of Computing Science University of Alberta, Edmonton, Canada {ajocksch,amaral}@cs.ualberta.ca 2 IBM Toronto Software Laboratory, Toronto, Canada Abstract. The productivity of a compiler development team depends on its ability not only to the design effective solutions to known code generation problems, but also to uncover potential code improvement op- portunities. This paper describes a data mining tool that can be used to identify such opportunities based on a combination of hardware-profiling data and on compiler-generated counters. This data is combined into an Execution Flow Graph (EFG) and then FlowGSP, a new data min- ing algorithm, finds sequences of attributes associated with subpaths of the EFG. Many examples of important opportunities for code improve- ment in the IBM R Testarossa compiler are described to illustrate the usefulness of this data mining technique. This mining tool is specially useful for programs whose execution is not dominated by a small set of frequently executed loops. Information about the amount of space and time required to run the mining tool are also provided. In comparison with manual search through the data, the mining tool saved a significant amount of compiler development time and effort. 1 Introduction Compiler developers continue to face the challenges of accelerated time-to-market and significantly reduced release cycles for both hardware and software. Micro- architectures continue to grow in numbers, complexity, and diversity. In this evolving technological environment, commercial-compiler developing teams must discover and rank the next set of opportunities for code transformations that will provide the highest performance improvement per development cost ratio. The discovery of opportunities for profitable code transformations in large enterprise applications presents additional challenges. Traditionally, compiler de- velopers have relied on the intuition that the code that is relevant for perfor- mance improvement is located in easily identifiable, frequently executed, regions of the code — often called hot loops. However, many enterprise applications do not exhibit discernible regions of frequently executed code. Rather, these applications exhibit a flat profile: thousands of methods are invoked along an execution path, and no single method accounts for a significant portion of the R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 10–25, 2010. c Springer-Verlag Berlin Heidelberg 2010