Mining Opportunities for Code Improvement in a Just-In-Time Compiler Adam Jocksch 1 , Marcel Mitran 2 , Joran Siu 2 , Nikola Grcevski 2 , and Jos´ e Nelson Amaral 1 1 Department of Computing Science University of Alberta, Edmonton, Canada {ajocksch,amaral}@cs.ualberta.ca 2 IBM Toronto Software Laboratory, Toronto, Canada Abstract. The productivity of a compiler development team depends on its ability not only to the design eﬀective solutions to known code generation problems, but also to uncover potential code improvement op- portunities. This paper describes a data mining tool that can be used to identify such opportunities based on a combination of hardware-proﬁling data and on compiler-generated counters. This data is combined into an Execution Flow Graph (EFG) and then FlowGSP, a new data min- ing algorithm, ﬁnds sequences of attributes associated with subpaths of the EFG. Many examples of important opportunities for code improve- ment in the IBM R  Testarossa compiler are described to illustrate the usefulness of this data mining technique. This mining tool is specially useful for programs whose execution is not dominated by a small set of frequently executed loops. Information about the amount of space and time required to run the mining tool are also provided. In comparison with manual search through the data, the mining tool saved a signiﬁcant amount of compiler development time and eﬀort. 1 Introduction Compiler developers continue to face the challenges of accelerated time-to-market and signiﬁcantly reduced release cycles for both hardware and software. Micro- architectures continue to grow in numbers, complexity, and diversity. In this evolving technological environment, commercial-compiler developing teams must discover and rank the next set of opportunities for code transformations that will provide the highest performance improvement per development cost ratio. The discovery of opportunities for proﬁtable code transformations in large enterprise applications presents additional challenges. Traditionally, compiler de- velopers have relied on the intuition that the code that is relevant for perfor- mance improvement is located in easily identiﬁable, frequently executed, regions of the code — often called hot loops. However, many enterprise applications do not exhibit discernible regions of frequently executed code. Rather, these applications exhibit a ﬂat proﬁle: thousands of methods are invoked along an execution path, and no single method accounts for a signiﬁcant portion of the R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 10–25, 2010. c  Springer-Verlag Berlin Heidelberg 2010