A CRITICAL EVALUATION OF STOCHASTIC ALGORITHMS FOR CONVEX OPTIMIZATION Simon Wiesler 1 , Alexander Richard 1 , Ralf Schl¨ uter 1 , Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany 2 LIMSI CNRS, Spoken Language Processing Group, Paris, France {wiesler, richard, schlueter, ney}@cs.rwth-aachen.de ABSTRACT Log-linear models ﬁnd a wide range of applications in pat- tern recognition. The training of log-linear models is a con- vex optimization problem. In this work, we compare the performance of stochastic and batch optimization algorithms. Stochastic algorithms are fast on large data sets but can not be parallelized well. In our experiments on a broadcast conver- sations recognition task, stochastic methods yield competitive results after only a short training period, but when spending enough computational resources for parallelization, batch algorithms are competitive with stochastic algorithms. We obtained slight improvements by using a stochastic second order algorithm. Our best log-linear model outperforms the maximum likelihood trained Gaussian mixture model base- line although being ten times smaller. Index Terms— discriminative models, optimization, speech recognition 1. INTRODUCTION Conventional speech recognition systems follow the genera- tive statistical approach. Typically, hidden Markov models (HMMs) with Gaussian mixture models (GMMs) as emis- sion models are used for modeling the joint probability of the spoken word sequence and the acoustic vector sequence. Such models can be trained efﬁciently with the expectation maximization (EM) algorithm. Their performance can be im- proved by a subsequent discriminative training, e.g. accord- ing to the minimum phone error (MPE) [1] criterion. In recent years, the interest in discriminative models for speech recognition has greatly increased. Strong empirical results have been obtained with hierarchical discriminative models, e.g. [2]. Another line of research studies discrimi- native models with a ﬂat structure [3, 4, 5, 6]. Our interest is in the use of log-linear models, which are attractive, because This work was partly realized under the Quaero Programme, funded by OSEO, French State agency for innovation. The research leading to these results has received funding from the European Union Seventh Framework Programme EU-Bridge (FP7/2007-2013) under grant agreement N287658. H. Ney was partially supported by a senior chair award from DIGITEO, a French research cluster in Ile-de-France. they are statistical models with a convex training criterion. The convexity allows for ﬁnding the global optimum of the training criterion. In our recent work, we showed that perfor- mance competitive to discriminatively trained GMMs can be obtained by using log-linear models [7]. A drawback of all discriminative approaches are the high computational costs required in training. Therefore, the ef- ﬁciency of optimization algorithms is an important research topic. In general, optimization methods for machine learn- ing can be subdivided into two categories: batch algorithms and stochastic algorithms. In batch algorithms, the statistics which are used for updating the model are computed on the full dataset. In stochastic algorithms, only a small random subset is used. Stochastic algorithms are very promising on large and redundant datasets. However, batch algorithms can be accelerated strongly by using second order information. This is in contrast to the most widely used stochastic opti- mization algorithm stochastic gradient descent (SGD). Fur- ther, batch algorithms can be parallelized straightforwardly. Stochastic algorithms are widely used for training hierarchi- cal models. For convex optimization, typically batch algo- rithms are employed. However, in recent years, stochastic algorithms have gained a lot of attention for convex models [8], [9], [10], [11]. In this paper, we investigate whether stochastic algorithms are beneﬁcial for optimizing log-linear models. Further- more, we compare SGD with several stochastic algorithms that make use of second order information. In addition, we compare the numerical robustness of stochastic and batch algorithms. Experiments are performed on a challenging English broadcast conversation task. 2. MODEL AND TRAINING CRITERION Log-linear models are used to model class-posterior prob- abilities. Let X ⊂ R D denote the observation space and C = {1,...,C } a set of classes. A log-linear model is of the form p Λ (c|x)= exp( ∑ D d=1 λ c,d x d ) ∑ c ′ ∈C exp( ∑ D d=1 λ c ′ ,d x d ) , (1)