Data Streaming Algorithms for the Chi-Square Test Emily Farrow, Junbo Li, Farhan Zaki, Ashwin Lall Department of Mathematics and Computer Science Denison University Granville, OH, USA Abstract—We present space-efficient algorithms for perform- ing Pearson’s chi-square goodness-of-fit test in a streaming setting. The chi-square test is one of the most popular and widespread used tests in statistics. The test is not based on a specific distribution assumption and has one-sample and two- sample variants. Given a stream of data, the one-sample variant tests if the stream is drawn from a fixed distribution and the two-sample variant tests if two data streams are drawn from the same or similar distributions. The chi-square test has strong advantages over similar tests. For example, the chi-square test can be applied to both categorical and continuous data. The problem that we solve in this paper is how to compute the chi-square statistic without making any assumptions about the stream beforehand. We give rigorous proofs showing that it is possible to compute the chi-square statistic with high fidelity and an almost quadratic reduction in memory. We validate the performance and accuracy of our algorithms through extensive testing on both real and synthetic data sets. Keywords-Chi-square test, Streaming algorithms I. I NTRODUCTION Over the last few decades, modern computer and network- ing systems have created the ability to continuously generate enormous volumes of data at a very high speed. This data is being generated by sensors, systems, mobile devices, readers, and every digital process that occurs and is gen- erated at a global level from the functioning of almost every business, government operation, and social media exchange. Analyzing and studying this data is important in forecasting weather patterns and natural disasters, preventing frauds, making scientific discoveries, predicting the stock market, tracking trends, making better business decisions, running web applications and numerous other uses [3]. Big Data has become so large that it is unfeasible to translate, store, and process. This exponential growth of data is unavoidable in the modern age. However, efficient technologies to store the generated information without massive down-sampling have not been developed. The inability of fast memory capacity to keep up with the size of this data has become one of the most pressing challenges of Big Data. There are a number of existing statistical techniques that can be used to analyze Big Data. These statistical tests provide a method to determine the validity of a hypothesis based on observation and evidence, and can be used to prove or disprove conjectures. Perhaps the best known, and most widely used, among these tests is Pearson’s chi-square goodness-of-fit test (henceforth referred to as the chi-square test in this paper). This test has the advantages of being non-parametric, i.e., it is not tied to any specific distribution such as the Gaussian, and can be used for continuous as well as categorical data. It is therefore surprising that there have been no sub-linear memory algorithms for this test proposed in the literature. In this paper, we will show how to compute the chi-square test to concisely yet accurately check if a particular stream of continuous data comes from a fixed known distribution, if two streams of continuous data come from the same source, and if two categorical data streams have a similar underlying distribution. In the streaming model, the input is presented as a stream of updates and the challenge is to answer some question about the stream, as it goes by in a single pass, without storing all of it. There has been considerable research done in this model (see [1], [14] for surveys on the topic). However, most of the work done in this area has been on fundamen- tal operators, such as frequency moments, cardinality, and quantiles. While these operators are important, they are not as useful for practitioners who may not know how to apply them. This is the first work to perform the chi-square tests, which are already widely used, in this framework. To perform the continuous version of the chi-square test, we need to partition the data into bins, where the expected frequency of each bin is compared to the observed frequency to calculate the test statistic. There is a requirement that each bin has an expected value of at least five. In our set- ting, the distribution of the stream is unknown beforehand, therefore picking bins without any foreknowledge of the distribution can lead to one or more bins having fewer than five samples—violating the above requirement. An alternate approach is to sample an initial section of the stream and choose bins based upon this sample. The downside to this is that we need to make an identical and independently distributed (i.i.d.) assumption about the stream, something that is not always the case. For example, network traffic data is notorious for being very bursty, exhibiting non- stationary distributions. The main challenge overcome by the algorithms in our paper is that we are able to compute the chi-square statistic in each of these cases without any prior knowledge about the distribution being measured and making few assumptions about the stream.