1 Entropic Determinants of Massive Matrices Diego Granziol and Stephen Roberts Abstract—The ability of many powerful machine learning algorithms to deal with large data sets without compromise is often hampered by computationally expensive linear algebra tasks, of which calculating the log determinant is a canoni- cal example. In this paper we demonstrate the optimality of Maximum Entropy methods in approximating such calculations. We prove the equivalence between mean value constraints and sample expectations in the big data limit, that Covariance matrix eigenvalue distributions can be completely defined by moment information and that the reduction of the self entropy of a maximum entropy proposal distribution, achieved by adding more moments reduces the KL divergence between the proposal and true eigenvalue distribution. We empirically verify our results on a variety of SparseSuite matrices and establish best practices. Index Terms—Maximum entropy methods, approximation methods, Matrix Theory, constrained optimization, noisy con- straints, log determinants. I. MOTIVATION Scalability is one of the key challenges facing machine learning algorithms. In the era of large data sets, inference schemes are required to deliver optimal results within a con- strained computational cost. Linear algebraic operations with high computational complexity pose a significant bottleneck to algorithmic scalability, and the log determinant of a matrix [1] falls firmly within this category of operations. The typical solution, involving Cholesky decomposition [2] for a general n × n positive definite matrix, A, entails time complexity of O(n 3 ) and storage requirements of O(n 2 ), which is unfeasible for large matrices. We further find that, along with making multiple matrix copies, typical implementations of Cholesky decomposition require contiguous memory. Consequently, the difficulty in calculating this term greatly hinders widespread use of the learning models where it appears, which includes determinantal point processes [3], Gaussian processes [4], and graph problems [5]. II. CONTRIBUTIONS OF THIS PAPER Recent work combining Maximum Entropy algorithms with stochastic trace estimates of moments displayed state of the art performance on log determinant estimates with an O(n 2 ) computational time on both randomly generated and sparse matrices [6], with results shown in Figure 1. In this paper we address and answer many open pedagogical and practical concerns, such as: 1) Why should we characterize an eigenvalue probability distribution by its moments? To what extent do they embody relevant information? 2) What is the equivalence between sample averages and mean value constraints? When are they identical? Machine Learning Research Group, University of Oxford Fig. 1. Absolute relative error of log determinant calculations on a collection SuiteSparse datasets using Stochastic trace estimate input data. MaxEnt (black dots) substantially outperforms other methods, figure originally from [6] 3) Can we characterize an eigenvalue distribution better with more moment constraints? Why do the Maxent predictions in Figure 1 from [6] get worse beyond a certain number of included moments? 4) If a practitioner wants to use MaxEnt algorithms and stochastic trace estimates to generate an accurate log de- terminant estimate of a large matrix, how many samples and how many moments do they need to take? III. CAN MOMENTS FULLY DESCRIBE PROBABILITY DISTRIBUTIONS? For a probability measure µ having finite moments of all orders α k = ∞ −∞ x k µ(dx), if the power series ∑ k α k /k! has a positive radius of convergence, that µ is the only probability measure with the moments α 1 ,α 2 , ... [7]. The proof essentially shows that such measures must share a characteristic function, which by the uniqueness theorem for characteristic functions, implies a unique measure and rests on the result that as n →∞ the ratio of the n th absolute moment to n! goes to 0 i.e. |x| n µ(dx) n! n→∞ −−−−→ 0. (1) Which means that in the n →∞ limit the growth of absolute moments can be at most n, namely, β n β n−1 n→∞ −−−−→≤ n. (2) A. Application to Entropic Trace Estimation Consider a random variable z with mean m and variance σ using the property of the expectations of quadratic forms, the expectation E[zz t ]= σ + mm t = I, (3) arXiv:1709.02702v1 [stat.ML] 8 Sep 2017