Stochastic performance tuning of complex simulation applications using unsupervised machine learning Oksana Shadura CERN Geneve, Switzerland Email: oksana.shadura@cern.ch Federico Carminati CERN Geneve, Switzerland Email: federico.carminati@cern.ch Abstract—Machine learning for complex multi-objective prob- lems (MOP) can substantially speedup the discovery of solutions belonging to Pareto landscapes and improve Pareto front accu- racy. Studying convergence speedup of multi-objective search on well-known benchmarks is an important step in the development of algorithms to optimize complex problems such as High Energy Physics particle transport simulations. In this paper we will describe how we perform this optimization via a tuning based on genetic algorithms and machine learning for MOP. One of the approaches described is based on the introduction of a specific multivariate analysis operator that can be used in case of expensive fitness function evaluations, in order to speed-up the convergence of the ”black-box” optimization problem. I. I NTRODUCTION Modern fundamental science requires the development of complex experimental machines like the LHC in High En- ergy Physics or land and space based x-ray and gamma-ray telescopes in High Energy Astrophysics. Other examples can be found in the fields of protein synthesis, gene regulation research on genome evolution. All these activities generate large data sets and require the development of new approaches and methods for their efficient analysis on modern computer platforms. In the point of the work on analyzing and optimizing the performance of the GeantV code [1], which is the prototype of the next-generation particle transport simulation software intended to succeed to Geant4 [2], which is the current golden standard in high energy physics (HEP) and beyond. Geant4 is a toolkit for simulation of the passage of particles through different kinds of matter, with application including high energy and nuclear physics, accelerator physics, medicine and space science. It is widely used in HEP experiments at the Large Hadron Collider (LHC) located at CERN (Geneva, Switzerland). One of the objectives of the GeantV project is to achieve good performance on a wide range of modern computing architectures with good scalability for complex computations. This is important since Geant4 is the single program consum- ing the largest share (50%) of the CPU cycles used for HEP. This code was developed in the 90s and it is now not well suited to take advantage from the latest CPU and accelerator architectures. The GeantV project started in 2013, following an R&D phase focused on optimal exploitation of instruction level parallelism for particle transport simulation both on CPU and on accelerators such as GPUs and Intel Xeon Phi R . Emphasis has been put on the optimization of cache usage by careful management of data locality [3]. GeantV is getting signifi- cant benefits via a specially developed computational solid geometry (CSG) modeler, which provides a set of optimized shape primitives and highly parallel geometry navigator. This provides GeantV with the necessary ray-tracing functionality for the efficient propagation of particles through the target geometry [4]. The GeantV project is recasting the simulation algorithms to get maximum benefit from SIMD/MIMD architectures on highly massive parallel systems [5]. This involves finding the optimal balance of several factors influencing computational performance (floating-point performance, off-chip memory bandwidth, usage of cache and memory hierarchy and etc.). As a consequence, a large number of parameters have to be optimized. This optimization task can be treated as a black- box problem, which requires searching the optimum set of parameters using only point-wise function evaluations. In our optimization work, we consider particle transport simulation to be a complex heuristic parametric model with costly evaluations and unpredictable behavior of fitness land- scape, that we intend to optimize using stochastic search algorithms. The objective of this work is to observe whether, by using unsupervised machine learning, we can accelerate the process of finding a Pareto front describing dominance relations between fitness functions. Results described in this article is part of the research on the ”black-box” optimization of GeantV as a multi-objective problem for performance measurements. Combining together genetic algorithm and machine learning approach we will try to discover special behaviors and fixed points of evolutionary systems, trying to accelerate convergence rate of algorithm for ”black-box” optimization. Before going to optimize GeantV simulations, we will try to prototype algorithm’s performance