Exploring the Capacity of a Modern SMT Architecture to Deliver High Scientific Application Performance Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis and Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering {valia,anastop,kkourt,nkoziris}@cslab.ece.ntua.gr Abstract. Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demon- strated that heterogeneity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent, threads. In this paper, we explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instructions streams. We evaluate and contrast spec- ulative precomputation (SPR) and thread-level parallelism (TLP) tech- niques for a series of scientific codes executed on an SMT processor. We also examine the effect of thread synchronization mechanisms on multi- threaded parallel applications that are executed on a single SMT proces- sor. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor. 1 Introduction Despite the efficiency of code optimization techniques and the continued ad- vances in caches, memory latency still dominates the performance of many ap- plications on modern processors. This CPU-memory gap seems difficult to be alleviated; on the one hand, CPU clock speeds continue to advance more rapidly than memory access times, on the other hand, the data working sets increase and complexity of conventional applications sets a limit on ILP. One approach to maintain high throughput of processors despite the large relative memory latency has been Simultaneous Multithreading (SMT). SMT is a hardware technique that allows a processor to issue and execute instructions from multiple independent threads in the same cycle. The dynamic sharing of This research is supported by the Pythagoras II Project (EPEAEK II), co-founded by the European Social Fund (75%) and National Resources (25%).