1 A Comparison Study of Data Cache Schemes Exploiting Reuse Information in Multiprocessor Systems J. Sahuquillo, A. Pont, S. Petit, and V. Milutinovic 1 Dept. Informatica de Sistemas y Computadores 1 Department of Computer Engineering Universidad Politecnica de Valencia School of Electrical Engineering, University of Belgrade Cno. De Vera s/n, 46071 Valencia (Spain) POB 35-54, 11120 Belgrade, Serbia, Yugoslavia {jsahuqui,apont,spetit}@disca.upv.es VM@etf.bg.ac.yu Abstract Recent cache research has mainly focused on designs for splitting the first-level data cache. Although the underlying goal of all cache schemes is to exploit data localities, the means to this end vary widely between different schemes. One important set of schemes appearing in the open literature makes use of the reuse information. Performance results presented in the open literature are usually given for uniprocessor systems. Because of the current scale of integration and the advances in technology, a large amount of research has been concentrated on symmetric multiprocessors (SMPs); specifically, on SMPs integrated in a single chip. We feel that it is worthwhile studying the impact of such cache schemes on SMPs. A previous block management analysis in uniprocessor systems has been performed. This paper studies the impact on a SMP system of three schemes that have appeared in the literature and make use of the reuse information. Speedup is included in the results that are obtained for one, two and four processors, running a subset of the SPLASH-2 benchmark suite. One scheme split the cache according to the criterion of data locality (NTS), while the others use counters to estimate data localities (Filter and Filter-swap schemes). Results show that the Filter scheme presents the best block management and it is, at least, equivalent to a conventional cache of 25KB (with one processor). Furthermore, the effectiveness of the management improves as the number of processors increases, which seems to be the current trend. Keywords: splitting data caches, data localities, performance evaluation, reuse information. 1. INTRODUCTION One of the most important goals when designing high performance computers is to minimize the average data memory access time. This time, measured in processor cycles, has been growing in recent years and so further increasing the memory-processor gap. However, cache organizations in modern microprocessors are basically the same as they were two decades ago. Recent research [1-12] has focused on optimizing the first-level (L1) data cache organization in order to increase the L1 hit ratio and reduce this critical time. The proposed models usually classify the data lines in two independent sets according to a predefined characteristic shown by the data. To improve performance, both types of data are cached and treated separately in caches with independent organizations. For this purpose, the L1 cache is usually split into two parallel caches (also called subcaches because both form the first level) and each caches one type of predefined data line. The main advantage of having two independent subcaches is that it is possible to tune