1 A Survey on Evaluating and Optimizing Performance of Intel Xeon Phi Sparsh Mittal Abstract Intel’s Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer-architecture and high-performance computing. Index Terms Review, many-core processor, many-integrated core (MIC), vectorization, prefetching, compiler optimiza- tion. 1 I NTRODUCTION As power budget and clock frequency of modern processors reach a plateau, parallelization has become the way to continue to scale performance. This has motivated the researchers to design multi-core and even, many-core processing units. Specifically, Intel’s Xeon Phi [1] brings together the parallel processing power of a many-core computing unit and the programming ease of traditional CPUs [2]. In fact, in June 2018 list of Top500 supercomputers, 19 supercomputers used Phi as the main processing unit [3] and seven supercomputers used Phi as the co-processor. Also, researchers from a wide range of background have deployed Phi for accelerating compute-intensive tasks and have also compared it with other processing units, such as multi-core CPUs and GPUs. Recently, KNC and KNL Phis have been discontinued. As of October 2019, KNM Phi is still shipping, but Intel has no roadmap for Phi. Further, all the salient features/optimizations of Xeon Phi have been incorporated into recent CPUs. As such, it is the right time to look back on the performance ‘score-card’ of Phi and also reflect on the salient features and limitations of Phi. These insights and lessons will be useful for designers of next-generation computing systems. Contributions: In this paper, we present a survey of works that evaluate and optimize efficiency of Phi for a broad-range of applications. Figure 1 shows the outline of the paper. We first present a background on Phi design and management policies (Section 2) and then discuss the strengths and limitations of Phi (Section 3). We then review the works that study Phi architecture and optimization techniques (Section 4). Next, we review the techniques for achieving data, thread and node-level parallelization (Section 5). Further, we discuss the works in terms of their application domain (Section 6). We then discuss works that perform comparative evaluation and collaborative execution of Phi with CPUs and/or GPUs (Section 7). Finally, we conclude the paper with a mention of future outlook (Section 8). The insights provided by this survey will be valuable for further optimizing performance of Phi and even other processing units. Dr. Sparsh is with IIT Hyderabad, Kandi, Sangareddy 502285, Telangana, India. E-mail:sparsh@iith.ac.in. Support for this work was provided by Science and Engineering Research Board (SERB), India, award number ECR/2017/000622 and by Semiconductor Research Corporation. This paper has been accepted in Concurrency and Computation: Practice and Experience, 2020.