A Dynamic Schema to increase performance in Many-core Architectures through Percolation operations Elkin Garcia * , Daniel Orozco * , Rishi Khan † , Ioannis E. Venetis ‡ , Kelly Livingston * and Guang R. Gao * * Computer Architecture and Parallel System Laboratory (CAPSL) - Electrical and Computer Engineering Department University of Delaware. Newark, DE. 19716, U.S.A. Email: {egarcia@, orozco@, kelly@, ggao@capsl.}udel.edu † ET International Newark, DE 19711, U.S.A. Email: rishi@etinternational.com ‡ Department of Computer Engineering and Informatics University of Patras. Rion 26500, Greece Email: venetis@ceid.upatras.gr Abstract—Optimization of parallel applications under new many-core architectures is challenging even for regular applica- tions. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern many- core to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively. I. I NTRODUCTION This paper presents a comprehensive case of study that shows how to obtain high performance in modern many-core processors. This study is important because it addresses a sit- uation arising on many-core architectures and not previously encountered in multi-core architectures, or other systems Early results of this research were published as a short paper in Computer Frontiers 2012 under the tile “Dynamic Percolation: A Case of Study on the Shortcomings of Traditional Optimization in Many-core Architectures”. This paper extends the content of our previous publication such as clusters or shared memory processors. Many-cores provide an environment where hardware resources are un- complicated and abundant. Large numbers of thread units are present, on-chip memory can be user-managed, automatic data cache may not be present and hardware support for synchronization is available. In summary, the environment is different, and it requires a new optimization paradigm. We have observed that the use of traditional optimization techniques does not result in the best performance in many- core architectures. As an example, we take the simple case of dense matrix multiplication (DMM) running on a modern many-core architecture such as the IBM Cyclops-64 pro- cessor (C64) [8]; the extensive efforts toward optimization of this important kernel only resulted in a disappointing performance of 44.12 GFLOPS (out of 80 GFLOPS) pos- sible [19], [17]. This far-from-optimal performance was not the result of lack of trying. The study presented by Garcia explored a broad range of optimization strategies: Multi- ple levels of tiling were employed, instruction scheduling, register allocation and instruction selection was done by hand, the code was written in assembly, pipelining was used, synchronization was optimized through the use of hand- written assembly primitives and so on. The study presented by Garcia ultimately shows that peak performance could not be achieved by static techniques alone, even for simple, highly parallel and regular programs such as matrix multiply. Being surprised by Garcia’s early results in Matrix Mul- tiply, we analyzed their experiments to find why their methodical approach failed to achieve peak performance. Through extensive profiling, we have seen that static plans are bound to fail to achieve peak performance in many- core architectures. Mainly, this happens because it is not possible to statically create a plan that efficiently schedules data movement and computation at the right times. The reason is that small variations in the execution of tasks (or