MPT: Multiple Parallel Tempering for High-Throughput MCMC Samplers Morteza Hosseini Computer Science and Electrical Engineering Department University of Maryland, Baltimore County Maryland, USA hs10@umbc.edu Rashidul Islam Computer Science and Electrical Engineering Department University of Maryland, Baltimore County Maryland, USA dr97531@umbc.edu Lahir Marni Computer Science and Electrical Engineering Department University of Maryland, Baltimore County Maryland, USA mlahir1@umbc.edu Tinoosh Mohsenin Computer Science and Electrical Engineering Department University of Maryland, Baltimore County Maryland, USA tinoosh@umbc.edu Abstract—This paper proposes “Multiple Parallel Tempering” (MPT) as a class of Markov Chain Monte Carlo (MCMC) algo- rithm for high-throughput hardware implementations. MCMC algorithms are used to generate samples from target probability densities and are commonly employed in stochastic processing techniques such as Bayesian inference, and maximum likelihood estimation, in which computing large amount of data in real-time with high-throughput samplers is critical. For high-dimensional and multi-modal probability densities, Parallel Tempering (PT) MCMC has proven to have superior mixing and higher conver- gence to the target distribution as compared to other popular MCMC algorithms such as Metropolis-Hastings (MH). MPT algorithm, proposed in this paper, imposes a new integer pa- rameter, D, to the original algorithm of PT. Such modiﬁcation changes one MCMC sampler into multiple independent kernels that alternatively generate their set of samples one after another. Our experimental results on Gaussian mixture models show that for large values of D, the auto-correlation function of the proposed MPT falls comparably to that of a PT sampler. A fully conﬁgurable and pipelined hardware accelerator for the proposed MPT, as well as PT are designed in Verilog HDL and implemented on FPGA. The two algorithms are also written in C language and evaluated on Multi-core CPU from the TX2 SoC. Our implementation results indicate that by selecting an appropriate value for D in our case study the sampling throughput of the MPT can raise from 4.5 Msps in PT to 135 Msps on average, an amount near maximum achievable frequency of the target FPGA, which is about 1470× higher than when implementing on fully exploited Multi-core CPU. Index Terms—MCMC, Mixture model, Parallel Tempering, High-Throughput Sampler, Hardware Accelerator I. I NTRODUCTION MCMC is one of the essential tools for stochastic processing techniques and is mainly used for solving Bayesian inference problems found in several ﬁelds such as modern machine learning [1], probabilistic mixture models [2], genetics [3] and prediction of time series events [4][5]. Stochastic estimators such as MCMC are regularly becoming methods of choice as deterministic numerical techniques are inefﬁcient for large dimension data. In probabilistic inference problem, we often need to compute expectation of a function g(x) for a random variable x by solving the integral: E[g(x)] =  g(x)p(x)dx (1) where p(x) is the target probability distribution. MCMC consists of two steps: Monte Carlo, that indicates that the process is based on random events, and Markov Chain, that implies the sequence of the events are causal, which means every current event is a result of previous event (or events). In Monte Carlo integration, a set of samples, x (t) where t=1... N are drawn out from a target distribution to approximate an expected value as calculated below: ˆ E[g(x)] = 1 N N  i=1 g(x (i) ) (2) Generally, the accuracy of the MCMC sampler can be improved by increasing number of samples, “N”. However, there is no unique way to measure the convergence of MCMC samples [6]. Common techniques include checking the auto- correlation of the samples from the underlying sampler, or using several independent sampling kernels initialized to dif- ferent states that should eventually generate similar results once converged [6][2]. Markov chain is an iterative stochastic process which is initialized to a starting state, x (1) , and determines the next states based on a transition probability, where the transition probability can be described by a 1 st order Markov chain as shown in equation 3: p(x (i) |x (i-1) ,x (i-2) , ..., x (1) )= p(x (i) |x (i-1) ) (3) MCMC contains iterative interdependent kernels which makes hardware acceleration challenging. To overcome this challenge, FPGA-based research has focused on paral- lel implementation of Markov chains to achieve accelera- tion [7][8][9][10][11][12][13].