Cache Misses Prediction by Means of Data Mining Methods Pavel Kordík, Ivan Šimeček kordikp, xsimecek@fel.cvut.cz Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University, Technická 2, 166 27 Prague 6, Czech Republic We obtained the ”Cache misses” data from the parallel computing group which is experimenting with simulation of cache memories. The goal of this project is to analyze this data and obtain the model of cache misses depending on input features such as associativity of the memory. We also need to determine the significance of input features. Every modern CPU use a complex memory hierarchy, which consists of levels of cache memories. It is really difficult to predict the behavior of this hierarchy for the given program (for details see [1]). The Cache Analyzer (shortly CA) simulates the behavior of a real microprocessor’s cache and compute the number of cache misses during a computation. All measurements are done in the ”off-line” mode; the CA uses own virtual cache memory for the exact simulation. It also means that another CPU activity doesn’t influence the behavior of the CA. Using this analyzer tool we collected data measuring number of cache misses for various parameters of cache, dimensionality of matrices, etc. These data were essential to make the analytical estimation of the number of cache misses (for details see [2]). The cache misses data were also analyzed by means of data mining methods. This is the main topic of this paper and we will discuss the data mining analysis bellow in the more detailed form. At first the data had to be preprocessed. It was transformed into the native format of the data mining application WEKA [3] where almost all experiments have been performed. We tried to predict the number of cache misses from input variables (read operations, size of matrices, etc.). The data mining methods from the category of decision trees, Bayes classifiers and neural networks were used. The detailed description of these methods can be found in [4]. Because WEKA is designed to solve mainly classification problems, we had divided the output attribute ”number of cache misses” into 10 intervals (classes). We achieved just 62% classification accuracy by Bayes based methods (Bayes Net, Naive Bayes Simple, etc.). Other methods were unable to give any results because of memory demands. When we studied why the performance is so low, we found out that the data should be further preprocessed. New data set were created from the original one by leaving out redundant measurements with low additional information. It consisted from 1500 records that representatively described number of cache misses (almost uniformly distributed). With this new data set we achieved for data mining methods following classification accuracies: MultiLayer Perceptron MLP (92%), Radial Basis Function network RBF (93%), decision tree C4.5 (95%), etc. This accuracy is perfect so we can conclude that using data mining methods, we can estimate the number of cache missed with relatively low error. Data mining also allows us to find out which input variables (features) are most important in estimating the number of cache misses (feature ranking). Again, we performed several experiments with method for feature ranking available in WEKA. The results show that the most important feature is readCount (number of read operations), followed by nonzero (number of nonzero elements in the matrix) and width (the thickness of the strip)