Bundle Methods for Structured Output Learning — Back to the Roots Michal Uˇriˇ c´ aˇ r, Vojtˇ ech Franc, and V´ aclav Hlav´ aˇ c Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Technick´ a 2, 166 27 Prague 6, Czech Republic {uricamic,xfrancv,hlavac}@cmp.felk.cvut.cz Abstract. Discriminative methods for learning structured output classifiers have been gaining popularity in recent years due to their successful applications in fields like computer vision, natural language processing, etc. Learning of the structured output classifiers leads to solving a convex minimization problem, still hard to solve by standard algorithms in real-life settings. A significant effort has been put to development of specialized solvers among which the Bundle Method for Risk Minimization (BMRM) [1] is one of the most successful. The BMRM is a simplified variant of bundle methods well known in the filed of non-smooth op- timization. In this paper, we propose two speed-up improvements of the BMRM: i) using the adaptive prox-term known from the original bundle methods, ii) start- ing optimization from a non-trivial initial solution. We combine both improve- ments with the multiple cutting plane model approximation [2]. Experiments on real-life data show consistently faster convergence achieving speedup up to factor of 9.7. Keywords: Structured Output Learning, Bundle Methods, Risk Minimization, Structured Output SVM. 1 Introduction Learning predictors from data is a standard machine learning task. A large number of such tasks are translated into a convex quadratically regularized risk minimization problem w ∗ = arg min w∈R n F (w) := λ 2 ‖w‖ 2 + R(w) . (1) The objective F : R n → R, referred to as the regularized risk, is a sum of the quadratic regularization term and a convex empirical risk R : R n → R. The scalar λ> 0 is a pre- defined constant and w ∈ R n is a parameter vector to be learned. The quadratic regu- larization term serves as a mean to constraint the space of solutions in order to improve generalization. The empirical risk evaluates a match between the parameters w and training examples. The risk typically splits into a sum of convex functions r i : R n → R, i.e. the risk reads R(w)= m i=1 r i (w) . (2) J.-K. K¨ am¨ ar¨ ainen and M. Koskela (Eds.): SCIA 2013, LNCS 7944, pp. 162–171, 2013. c Springer-Verlag Berlin Heidelberg 2013