Triple Jump Acceleration for the EM Algorithm Han-Shen Huang Bou-Ho Yang Chun-Nan Hsu Institute of Information Science Academia Sinica Nankang, Taipei, Taiwan {hanshen,ericyang,chunnan}@iis.sinica.edu.tw Abstract This paper presents the triple jump framework for ac- celerating the EM algorithm and other bound optimization methods. The idea is to extrapolate the third search point based on the previous two search points found by regu- lar EM. As the convergence rate of regular EM becomes slower, the distance of the triple jump will be longer, and thus provide higher speedup for data sets where EM con- verges slowly. Experimental results show that the triple jump framework significantly outperforms EM and other acceleration methods of EM for a variety of probabilistic models, especially when the data set is sparse or less struc- tured, which usually slow down EM but are common in real world data sets. The results also show that the triple jump framework is particularly effective for Cluster Models. 1. Introduction The Expectation-Maximization (EM) algorithm [6] is one of the most popular algorithms for learning probabilis- tic models from incomplete data. However, when applied to large real world data sets with a large number of param- eters to estimate, the EM algorithm is slow to converge. If the data sets also contain a large proportion of missing data or it is required to have a large number of hidden variables in the model to be learned, the convergence can be even slower. Our goal is to develop an approach to accelerating the EM algorithm that requires minimum prior knowledge about the data and no human intervention for tuning. The result is the triple jump acceleration framework. The idea is to extrapolate the third search point based on the previous two search points found by regular EM. The extrapolation can reach a point far away from the previous two points, like hop, step and jump in the triple jump. An important feature of the triple jump framework is that, as the conver- gence rate of regular EM becomes slower, the distance of the jump will be longer, and thus provide higher speedup for data sets where EM converges slowly. In addition to accelerating EM, the triple jump framework can also be ap- plied to other bound optimization methods [10], including iterative scaling [2], non-negative matrix factorization [7], and convex-concave computational procedure [13]. Experi- mental results show significant speedup for several different probabilistic models, including Bayesian Networks, Hidden Marcov Models, and Mixtures of Gaussians. Experimental results also show that the framework is particularly effec- tive for AUTOCLASS-like Cluster Models [4], for which the accelerated EM can always find the local optimum with one “triple jump,” regardless of the sparsity of data. The triple jump framework is based on the Aitken accel- eration method [3]. Previously, Bauer, Koller and Singer [1] has proposed an Aitken-based method to accelerate EM for Bayesian Networks called parameterized EM. Ortiz and Kaelbling [8] has proposed a similar method for Mix- tures of Gaussians. Though they showed that their meth- ods can speedup EM in their experiments, the convergence property of EM is no longer guaranteed. Salakhutdinov and Roweis [10] showed that with the learning rate (i.e., the extent of the extrapolation) within a certain interval, parameterized EM is guaranteed to converge. However, since the learning rate in such an interval is too small, the speedup will not be significant. Therefore, they pro- posed another method called adaptive overrelaxed EM [10], which switches back to regular EM during the search if the new data likelihood is not increased. In this way, the data likelihood will monotonically increase and adaptive overre- laxed EM is guaranteed to converge. They also generalized their methods to other bound optimization methods [10]. The triple jump framework accelerates adaptive overrelaxed methods further and is guaranteed to converge as well. A critical difference between the triple jump framework and those previous works is that we use different learning rates for independent sub-vectors in the parameter vectors. In contrast, previous works use one learning rate for all pa- rameters. The derivation in Salakhutdinov et. al. [10] shows that for the one-learning-rate case, the optimal learn-