Accepted as a workshop contribution at ICLR 2015 V ISUAL S CENE R EPRESENTATIONS : S UFFICIENCY,MINIMALITY,I NVARIANCE AND DEEP APPROXIMATION Stefano Soatto * University of California, Los Angeles soatto@ucla.edu Alessandro Chiuso Universit` a di Padova chiuso@dei.unipd.it ABSTRACT Visual representations are deﬁned as functions of visual data that approximate minimal sufﬁcient statistics for a class of tasks, and are maximally invariant to nuisance variability. We derive analytical expressions for such representations and show that they are related to “feature descriptors” commonly in use in the computer vision community, as well as to convolutional architectures. This new interpretation draws connections to the classical theories of sampling, hypothesis testing and group invariance. 1 I NTRODUCTION We deﬁne an ideal visual representation as a function φ of past data x t that is minimal sufﬁcient for answering questions about the scene θ given future data it generates, y, and maximally invariant to nuisance transformations g affecting the latter. Much of Computer Vision is about computing functions of images that are somehow “useful.” The process can be stacked to yield a convolutional architecture Bruna & Mallat (2011) whose weights may be inferred from data Ranzato et al. (2007); LeCun (2012); Simonyan et al. (2014); Serre et al. (2007); Bouvrie et al. (2009); Susskind et al. (2011). Rather than proposing yet another represen- tation and comparing it empirically, we derive a formal expression for ideal representations from basic principles of statistical decision theory: Sufﬁciency, minimality, invariance. Then, we show how existing representations are related to an optimal one, highlighting the (often tacit) underlying assumptions. We present an alternate interpretation of pooling, in the context of classical sampling theory, that differs from other analyses Gong et al. (2014); Boureau et al. (2010). 2 THE SOA LIKELIHOOD The data (images) X are random variables with samples x; the model (scene) θ is an inﬁnite- dimensional unknown parameter in the experiment E = {x, θ, p θ (x)}; a visual decision is a partition of θ;a statistic T is a function of the sample; it is sufﬁcient (of x for θ) if X | T = t does not depend on θ; it is minimal if it is a function of all other sufﬁcient statistics. The Likelihood function is L(θ; x) . = p θ (x).A nuisance g ∈ G is an unknown parameter that is not of interest and yet appears in the likelihood, p θ,g (x). If L(θ,g; x) is the joint likelihood, which for simplicity we indicate with L(θ,g), then L(θ) . = max g∈G L(θ,g) = max g∈G p θ,g (x) (1) is the proﬁle likelihood. If we treat g as a random variable with known prior dP (g) and deﬁne p θ (x|g) . = p θ,g (x), then L G (θ) . = Z G p θ (x|g)dP (g) (2) * Also UCLA Technical Report CSD140023, November 12, 2014 1 arXiv:1411.7676v5 [cs.CV] 17 Apr 2015