IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 49, NO. 10, OCTOBER 2004 1843 Single-Run Gradient Estimation Via Measure-Valued Differentiation Bernd Heidergott and Arie Hordijk Abstract—We show how single-run-based measure-valued differentia- tion gradient estimators can be obtained. The key idea is to apply a change of measure a posterior to the mathematical analysis of the derivative. From the point of view of the likelihood ratio method, we show that likelihood ratio type gradient estimators can be applied in situations where the mathematical conditions needed for applying a likelihood ratio analysis are not met. Index Terms—Gradient estimation, likelihood ratios, measure-valued differentiation, Markov chains. I. INTRODUCTION Many real-world systems in the area of telecommunication, manu- facturing, and transportation can be modeled as (general state-space) Markov chains. Deriving efficient gradient estimators for Markov chains is important for stochastic optimization of such systems. Measure-valued differentiation (MVD) is a mathematical framework for computing derivatives and gradient estimators, respectively, for Markov kernels. Specifically, in [7] a product rule of MVD for general state-space Markov chain has been established and in [6] sufficient conditions are provided for the unique stationary distribution of a general state-space Markov chain to have a measure-valued derivative. Derivative formulas obtained via the MVD approach can be translated into unbiased gradient estimators. Unfortunately, the MVD estimator, as for example introduced in [8], involves splitting a sample path into two subpaths called the positive and negative version of the Markov chain, respectively, or phantoms. This splitting has to be applied at any transition of the process which causes a considerable computational burden when running an MVD estimator. For geometrically ergodic chains, combining the splitting technique with an appropriate coupling leads to geometrically bounded cycle lengths of the gradient estimator for stationary characteristics; see [6]. The likelihood ratio (respectively, the score function) approach to gradient estimation for general state-space Markov chains is based on differentiating densities rather than measures, see [4], [3], [9], and [10]. While this approach is in general more restrictive than MVD it enjoys the nice property that likelihood ratio gradient estimators can be eval- uated through observing a single sample path: there is no splitting in- volved. This reduces the computational effort of estimating a gradient via a likelihood ratio estimator compared to a MVD estimator. It is widely believed that this feature of the estimators “single run based” on the one side and “splitting sample paths” on the other side is a generic distinction of the two methods. In this note we show that this is not the case. Specifically, we show that any MVD formula can be interpreted in a likelihood ratio manner by introducing appropriate Radon–Nykodim derivatives which are obtained via a convex combina- tion [to be defined in (6)] of the positive and negative part of the MVD derivative of the Markov kernel. Manuscript received July 7, 2003; revised June 7, 2004. Recommended by Associate Editor Y. Wardi. B. Heidergott is with the Vrije Universiteit Amsterdam and Tinbergen Institute, 1081 HV Amsterdam, The Netherlands (e-mail: bheidergott@ feweb.vu.nl). A. Hordijk is with the Leiden University, Mathematical Institute, 2300 RA Leiden, The Netherlands (e-mail: hordijk@math.leidenuniv.nl). Digital Object Identifier 10.1109/TAC.2004.835588 In this note, we address representations for the derivative of a gen- eral state-space Markov chain. Elaborating on appropriate product rules of differentiation, the extension of our results to finite horizon exper- iments is straightforward. Indeed, for likelihood ratios this extension readily follows from the product rule of differentiation for real-valued mappings and for MVD we resort to the product rule of MVD estab- lished in [6]. Moreover, derivative formulae for the unique stationary distribution of a Markov chain via likelihood ratios are given in [4] and via MVD in [6]. The note is organized as follows. Section II introduces the basic con- cepts of MVD. Section III presents the likelihood ratio approach. In Section IV, the combination of both approaches is discussed. Even- tually, we address the problem of finding the best possible way of constructing a single-run MVD estimator. II. MVD This section provides a brief review on the basic concepts of mea- sure-valued differentiation. For details we refer to [5]–[8]. Let be a Polish measurable space. Let denote the set of finite (signed) measures on and that of probability mea- sures on . A mapping on is called a (homogeneous) transition kernel on if a) for all , and b) is measurable for all . If, in condition a), can be replaced by , then is called a Markov kernel on . Denote the set of transition kernels on by and the set of Markov kernels on by . We consider a family of Markov kernels on , with . Let denote the set of measur- able mappings , such that is finite for all and . For , we call differentiable at with respect to , or -differentiable for short, if for any a exists such that for any (1) Notice that fails to be a Markov kernel which poses the problem of sampling from . In order to facilitate for simulation we introduce the notion of -derivative, which extends the concept of a weak derivative which is only defined for bounded continuous perfor- mance mappings. Let be -differentiable at . Any triple , with and a measurable mapping from to , that satisfies for all that (2) is called a -derivative of . The kernel is called the positive part and the kernel is called the negative part of , respectively. Sufficient condition for the existence of a -derivative can be found in [5]. Example 1: Let and let denote the Borel field on . For let denote the Markov chain on with transition probability 0018-9286/04$20.00 © 2004 IEEE