Sufficient Conditions for the Convergence of the Shannon Differential Entropy Jorge F. Silva Department of Electrical Engineering Universidad de Chile josilva@ing.uchile.cl Patricio Parada Department of Electrical Engineering Universidad de Chile pparada@ing.uchile.cl Abstract— This work revisits and extends results concerning the convergence of the Shannon differential entropy. Concrete connections with the convergence of probability measures in the sense of total variations and (direct and reverse) information divergence are established. In particular, under uniform bounded conditions on the sequence of probability measures, the results stipulate that the convergence in information divergence is sufficient to guarantee the convergence of the differential entropy functional. I. I NTRODUCTION It is well known that the Shannon entropy in the finite alphabet case is a continuous functional of the space of distributions with respect to the total variational distance [1]. In fact, this was one of the basic requirements considered by Shannon to define this measure [2]. Surprisingly, the continuity does not hold if we move from a finite alphabet to a countable infinite alphabet, as it was recently pointed out by Ho et al. [3], [4]. On the continuous alphabet case, the discontinuity is known to be an issue when considering information measures like the Shannon differential entropy [5], [6], [7], [8]. In statistical learning, this discontinuity implies that the results from distribution (or density) estimation do not map to the problem of estimating Shannon information measures. For example, Gyorfi et al. [6] found that extra conditions are needed to make plug-in histogram-based estimates consistent for the differential entropy, on top of the conditions needed to estimate the underlying density consistently in total variation. Silva et al. [8], [9] found similar results when working with data-dependent partitions in the context of mutual information (MI) and Kullback-Leibler divergence (KLD) estimation. In particular, they found stronger conditions for estimating MI and KLD, than the one obtained for a consistent estimation of the underlying density in the total variational distance sense [10]. In consequence, to characterize a concrete connection between the topics of density estimation and information mea- sure estimation, it is crucial to understand the conditions that guarantee convergence of the Shannon information measures [5]. Such a result (or results) would provide cross-fertilization between these two important lines of research, which have been mostly developed as independent tracks in the past. Motivated by that, this work studies the Shannon differential entropy in terms of its convergence properties. In this direc- tion, Piera et al. [5] have recently derived a number of condi- tions on a sequence of probability measures {P n ,n ∈ N} and a limiting distribution P to guarantee that lim n→∞ H(P n )= H(P ). In this work, we revisit, refine and extend these con- vergence results. In particular, we derive concrete relationships between convergence in (reverse and direct) I-divergence and the convergence of Shannon differential entropy. These rela- tionships are obtained in a number of settings, from stronger to weaker conditions on the limiting distribution and from weaker to stronger conditions on the way the sequence converges to P , respectively. In particular, the results show concrete scenarios where the weak convergence, the total variational convergence and the convergence in I-divergence suffice to guarantee the convergence of the differential entropy. The rest of the paper is organized as follows. We start with some preliminary material in Section II. Section III presents the mains results and we finish in Section IV with some final remarks. Some of the proofs are presented in the Appendix. II. PRELIMINARIES Let (R d , B(R d )) denote the standard k-dimensional Eu- clidean space equipped with the Borel sigma field [11]. Let X ∈B(R d ) be a Polish subspace of R d . For this space, let P (X) be the collection of probability measure in (X, B(X)) and let AC (X) ⊂P (X) denote the set of probability mea- sures absolutely continuous with respect to λ the Lebesgue measure 1 [11]. For any μ ∈ AC (X), ∂μ ∂λ (x) denotes the Radon- Nikodym (RN) derivative of μ with respect to λ. In addition, let us denote by AC + (X) the collection of probability measures μ ∈ AC (X) where ∂μ ∂λ (x) is strictly positive, Lebesgue almost everywhere in X. Note that when μ ∈ AC + (X) then μ and λ are mutually absolutely continuous in X, and consequently, ∂λ ∂μ (x) is well defined and equal to ∂μ ∂λ (x) -1 for Lebesgue almost every point x ∈ X. Let M(X) denote the space of measurable functions from X to R. Then for every μ ∈ AC (X) let us define, L 1 (∂μ)= f ∈M(X): X |f | ∂μ < ∞ , (1) the space of μ integrable functions, and consequently the L 1 - norm of f ∈ L 1 (∂μ) as ||f || L 1 (∂μ) = X |f | ∂μ. In addition, 1 A measure σ is absolutely continuous with respect to a measure μ, denoted by σ ≪ μ, if for any event A such that μ(A)=0, then σ(A)=0.