arXiv:2112.15572v1 [stat.ME] 31 Dec 2021 Statistical Scalability and Approximate Inference in Distributed Computing Environments Aritra Chakravorty, William S. Cleveland, and Patrick J. Wolfe {chakrav0,wsc,patrick}@purdue.edu January 3, 2022 Abstract Harnessing distributed computing environments to build scalable inference algorithms for very large data sets is a core challenge across the broad mathematical sciences. Here we provide a theoretical frame- work to do so along with fully implemented examples of scalable al- gorithms with performance guarantees. We begin by formalizing the class of statistics which admit straightforward calculation in such en- vironments through independent parallelization. We then show how to use such statistics to approximate arbitrary functional operators, thereby providing practitioners with a generic approximate inference procedure that does not require data to reside entirely in memory. We characterize the L 2 approximation properties of our approach, and then use it to treat two canonical examples that arise in large-scale statistical analyses: sample quantile calculation and local polynomial regression. A variety of avenues and extensions remain open for future work. Keywords: Approximate inference, distributed computing, non-parametric estimation, parallel algorithms, statistical scalability AMS subject classifications: 62R07, 62G08, 68T09, 68W10, 68W25 1