Accepted as a workshop contribution at ICLR 2015 V ISUAL S CENE R EPRESENTATIONS : S UFFICIENCY,MINIMALITY,I NVARIANCE AND DEEP APPROXIMATION Stefano Soatto * University of California, Los Angeles soatto@ucla.edu Alessandro Chiuso Universit` a di Padova chiuso@dei.unipd.it ABSTRACT Visual representations are defined as functions of visual data that approximate minimal sufficient statistics for a class of tasks, and are maximally invariant to nuisance variability. We derive analytical expressions for such representations and show that they are related to “feature descriptors” commonly in use in the computer vision community, as well as to convolutional architectures. This new interpretation draws connections to the classical theories of sampling, hypothesis testing and group invariance. 1 I NTRODUCTION We define an ideal visual representation as a function φ of past data x t that is minimal sufficient for answering questions about the scene θ given future data it generates, y, and maximally invariant to nuisance transformations g affecting the latter. Much of Computer Vision is about computing functions of images that are somehow “useful.” The process can be stacked to yield a convolutional architecture Bruna & Mallat (2011) whose weights may be inferred from data Ranzato et al. (2007); LeCun (2012); Simonyan et al. (2014); Serre et al. (2007); Bouvrie et al. (2009); Susskind et al. (2011). Rather than proposing yet another represen- tation and comparing it empirically, we derive a formal expression for ideal representations from basic principles of statistical decision theory: Sufficiency, minimality, invariance. Then, we show how existing representations are related to an optimal one, highlighting the (often tacit) underlying assumptions. We present an alternate interpretation of pooling, in the context of classical sampling theory, that differs from other analyses Gong et al. (2014); Boureau et al. (2010). 2 THE SOA LIKELIHOOD The data (images) X are random variables with samples x; the model (scene) θ is an infinite- dimensional unknown parameter in the experiment E = {x, θ, p θ (x)}; a visual decision is a partition of θ;a statistic T is a function of the sample; it is sufficient (of x for θ) if X | T = t does not depend on θ; it is minimal if it is a function of all other sufficient statistics. The Likelihood function is L(θ; x) . = p θ (x).A nuisance g G is an unknown parameter that is not of interest and yet appears in the likelihood, p θ,g (x). If L(θ,g; x) is the joint likelihood, which for simplicity we indicate with L(θ,g), then L(θ) . = max gG L(θ,g) = max gG p θ,g (x) (1) is the profile likelihood. If we treat g as a random variable with known prior dP (g) and define p θ (x|g) . = p θ,g (x), then L G (θ) . = Z G p θ (x|g)dP (g) (2) * Also UCLA Technical Report CSD140023, November 12, 2014 1 arXiv:1411.7676v5 [cs.CV] 17 Apr 2015