Seamless FPGA deployment over Spark in cloud computing: A use case on machine learning hardware acceleration. Christoforos Kachris, Ioannis Stamelos, Elias Koromilas, and Dimitrios Soudris Institute of Computer and Communications Systems (ICCS), GR Abstract. Emerging cloud applications like machine learning and data analytics need to process huge amount of data. Typical processor archi- tecture cannot achieve eﬃcient processing of the vast amount of data without consuming excessive amount of energy. Therefore, novel archi- tectures have to be adopted in the future data centers in order to face the increased amount of data that needs to be processed. In this pa- per, we present a novel scheme for the seamless deployment of FPGAs in the data centers under the Spark framework. The proposed scheme, developed in the VINEYARD project, allows the eﬃcient utilization of FPGAs without the need to change the applications. The performance evaluation is based on the KMeans ML algorithm that is widely used in clustering applications. The proposed scheme has been evaluated in a cluster of heterogeneous MPSoCs. The performance evaluation shows that the utilization of FPGAs can be used to speedup the machine learn- ing applications and reduce signiﬁcantly the energy consumption. Keywords: hardware accelerators, data centre, heterogeneous, big data 1 Introduction Machine learning, data analytics and Big Data are some of the emerging cloud applications responsible for the signiﬁcant increases in data-center workloads during the last years. In 2015, the total network traﬃc of the data centres was around 4.7 Exabytes and it is estimated that by the end of 2018 it will cross the 8.5-Exabyte mark, following a cumulative annual-growth rate (CAGR) of 23% [1]. In response to this scaling in network traﬃc, data-centre operators have resorted to utilizing more powerful servers. Relying on Moore’s law for the extra edge, CPU technologies have scaled in recent years through packing an increasing number of transistors on chip, leading to higher-performance ratings. However, on-chip clock frequencies were unable to follow this upward trend due to strict power-budget constraints. Thus, a few years ago a paradigm shift to multicore processors was adopted as an alternative solution for overcoming the problem. With multicore processors one could increase server performance with- out increasing their clock frequency. Unfortunately, this solution was soon found to scale poorly in the longer term, as well. The performance gains achieved by