Optimizing Message-Passing on Multicore Architectures using Hardware Multi-Threading Daniele Buono, Tiziano De Matteis, Gabriele Mencagli and Marco Vanneschi Department of Computer Science, University of Pisa Largo B. Pontecorvo, 3 I-56127 Pisa, Italy Email: {d.buono, dematteis, mencagli, vannesch}@di.unipi.it Abstract—Shared-memory and message-passing are two op- posite models to develop parallel computations. The shared- memory model, adopted by existing frameworks such as OpenMP, represents a de-facto standard on multi-/many-core architectures. However, message-passing deserves to be studied for its inherent properties in terms of portability and ﬂexibility as well as for its better ease of debugging. Achieving good performance from the use of messages in shared-memory architectures requires an efﬁcient implementation of the run-time support. This paper investigates the deﬁnition of a delegation mechanism on multi- threaded architectures able to: (i) overlap communications with calculation phases; (ii) parallelize distribution and collective oper- ations. Our ideas have been exempliﬁed using two parallel bench- marks on the Intel Phi, showing that in these applications our message-passing support outperforms MPI and reaches similar performance compared to standard OpenMP implementations. Keywords—message passing, shared memory, multi-core, communications-calculation overlapping, hardware multi- threading. I. I NTRODUCTION Traditionally, message-passing and shared-memory consti- tute the two fundamental concurrency models giving the basis to the modern programming models for parallel computing. The debate about the most efﬁcient and effective way to develop parallel programs represents a very complicated and controversial issue. In general, the best solution depends on the speciﬁc application pattern and the use of communica- tion/synchronization [2], [12]. The result is that existing frame- works adopt one of the concurrency model (notably OpenMP and MPI for shared-memory and message-passing program- ming respectively), while to choice of the most appropriate model for their own applications is left to programmers [22]. One of the most recognized advantage of the message- passing model is its inherent independence from the underlying architecture, i.e. it is portable in principle: provided that the proper run-time support is designed, a message-passing com- putation can be implemented on a distributed-memory cluster or on a shared-memory multiprocessor like a multi-/many-core chip multiprocessor, or (perhaps the most interesting case for very high parallelism) on a cluster of multi-/many-core nodes. On shared-memory architectures a comparison between the performance achieved by message-passing and shared- memory implementations of the same benchmark suite is useful to understand the performance issues of the two models in a very pragmatic way. The general result is that masking the communication impact on the overall performance and providing an efﬁcient way to accelerate distribution and col- lective functionalities are crucial aspects to design message- passing parallel programs able to scale acceptably with large parallelism degrees [14]. In this context, this paper investigates the design issues of message-passing supports on multi-/many- core and provides the following research contributions: • we describe a general run-time support mechanism based on communication threads coupled with the functionalities of parallel programs. Although the idea of using support threads is not new, we discuss very ﬂexible ways to assign communication threads to different sub-parts of the computation which need to be accelerated. The result is a uniform mechanism to overlap communications with calculation and to parallelize distribution activities and collective oper- ations; • we discuss the execution of communication threads on hardware contexts of multi-threaded architectures (notably the recent Intel Phi TM ). The importance of Hardware Multi-Threading (shortly HMT) for paral- lel programs is debatable. Instead of increasing the maximum parallelism degree of the computation, we propose to use HMT to execute support activities like the execution of point-to-point communications. This way to use HMT by cooperative threads is essentially new in the context of message-passing programming. In Section V we conclude the paper with an experimental evaluation using two parallel benchmarks on the Intel Phi TM . The results show that the use of communication threads, properly mapped onto HMT contexts, represents an effec- tive way to properly mask communication latencies and to parallelize distribution activities. The general result is that the performance of our message-passing implementations is similar to corresponding shared-memory versions developed using standard tools (the difference is bounded by 10%) and outperforms classic message-passing frameworks as MPI. II. RELATED WORK The analysis of message-passing libraries on multi-/many- core architectures and the performance comparison with shared-memory programming models is a debated issue. Stud- ies like in [2], [12], [22] compare MPI and OpenMP, obtaining results in favor of one of the two paradigms depending on the application. In general they all agree that the main overhead in MPI implementations is given by the communication support.