Variational Representations of Non-Gaussian Priors J. A. Palmer, K. Kreutz-Delgado, D. P. Wipf, and B. D. Rao Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093 {japalmer,kreutz,dwipf,brao}@ece.ucsd.edu Abstract Variational representations can be used to ac- commodate non-gaussianity while still main- taining some of the tractability of Gaussian models. Such representations however re- quire super-gaussianity, and cannot be used to represent sub-gaussian densities. In the literature, variational methods are often con- flated and criteria for application do not seem to be well understood. Here we distinguish among the various theories, give explicit cri- teria, and formulate general algorithms. 1 Introduction In Bayesian probabilistic models, Gaussian priors are commonly used because they lead to tractable analy- sis, yielding closed-form solutions for integrals and ex- trema. Gaussian priors have limitations, however, and are unsuitable in many important contexts. For ex- ample, Gaussian densities cannot be used to represent heavy-tailed, or “sparse” priors, which may be used to represent infrequently firing neurons or impulsive noise. Non-gaussian priors are also important in source separation, where one must model the actual distribu- tion of the independent sources, and non-gaussianity is indeed necessary for separability [23]. One approach to modelling non-gaussian random variables, while still maintaining analytic tractability, is through the use of “variational methods,” which represent non-gaussian densities “variationally” in terms of Gaussian. We can distinguish three related, yet distinct, uses of variational representations in probabilistic artificial in- telligence: (1) Convex bounding [43, 27, 24, 26, 8, 20]. This use of the term “variational” derives from the representation of a convex function as the supremum over a set of affine functions, f (x) = sup φ φx f (φ) (1) with the variable φ referred to as a “variational” pa- rameter. (2) Ensemble learning [34, 32, 5] and Vari- ational Bayes [4]. Here, instead of variational pa- rameters, one has a variational distribution, e.g. a factorial “mean-field” distribution. In this paper we are concerned with variational representations of non- gaussian priors, but we discuss ensemble methods as well for completeness. (3) Hyperpriors [35, 45]. The hyperprior representation is similar to the convex bounding representation in that “variational” param- eters are used, but rather than relying on a convexity relationship, hyperprior methods rely on an integral representation in which the variational parameters are variance or scale parameters. While variational methods have been used with in- creasing frequency in recent years, and some tutorial works have been written [28, 25], the literature tends to concentrate more on the use of specific methods, or on surveying various techniques as “tools in a tool- box”. In this paper we attempt a more general ex- plication of the relationship of these methods to one another, and give explicit criteria for their application. We take the concept of a variational representation of a function generally to mean the representation of the function as some kind of marginalization over another function, usually one that is easier to work with. In our case, we will be representing non-gaussian proba- bility densities as marginalizations over Gaussian den- sity functions. We consider two types of variational representation, which we shall call convex type and in- tegral type. In the convex type of variational representation, the density is represented as a supremum over Gaussian functions, p(x) = sup ξ N (x;0 -1 ) ϕ(ξ ) (2) This type of representation has been used primarily for estimation in graphical models and belief networks [43, 27, 24, 28]. It was applied to kernel machine es-