arXiv:1202.3399v2 [cs.DB] 21 Feb 2012 Measuring the achievable error of query sets under differential privacy Chao Li, Gerome Miklau University of Massachusetts Amherst, Massachusetts, USA {chaoli, miklau}@cs.umass.edu Abstract A common goal of privacy research is to release synthetic data that satisfies a formal privacy guarantee and can be used by an analyst in place of the original data. To achieve reasonable accuracy, a synthetic data set must be tuned to support a specified set of queries accurately, sacrificing fidelity for other queries. This work considers methods for producing synthetic data under differential privacy and investigates what makes a set of queries “easy” or “hard” to answer. We consider answering sets of linear counting queries using the matrix mechanism [15], a recent differentially-private mechanism that can reduce error by adding complex correlated noise adapted to a specified workload. Our main result is a novel lower bound on the minimum total error required to simultaneously release answers to a set of workload queries. The bound reveals that the hardness of a query workload is related to the spectral properties of the workload when it is represented in matrix form. The bound is tight and, because it satisfies important soundness criteria, it can serve as a reliable numerical measure of the “error complexity” of a workload. 1 Introduction Differential privacy [8] is a rigorous privacy standard offering participants in a data set the appealing guar- antee that released query answers will be nearly indistinguishable whether or not their data is included. The earliest methods for achieving differential privacy were interactive: an analyst submits a query to the server and receives a noisy query answer. Further queries may be submitted, but increasing noise will be added and the server may eventually refuse to answer subsequent queries. To avoid some of the challenges of the interactive model, differential privacy has often been adapted to a non-interactive setting, where a common goal has been to release a synthetic data set that the analyst can use in place of the original data. There are a number of appealing benefits to releasing a private synthetic database: the analyst need not carefully divide their task into individual queries, and can use familiar data processing techniques on the synthetic data; the privacy budget will not be exhausted before the queries of interest have been answered; and the analyst can carry out data analysis using their own resources and without revealing their tasks to the data owner. There are limits, however, to private synthetic data generation. When a synthetic dataset is released, the server no longer controls how many questions the analyst computes from the data. Dinur and Nissim showed that accurately answering “too many” queries of a certain type is incompatible with any reasonable notion of privacy, allowing reconstruction of the database with high probability [6]. This tempers the hopes of private synthetic data to some degree, suggesting that if a synthetic dataset is to be private, then it can be accurate only for a specific class of queries, and may need to sacrifice accuracy for other queries. A number of methods have been proposed for releasing accurate synthetic data for specific sets of queries [5, 15, 13, 22, 23, 3, 1, 19]. These results show that it is still possible to achieve many of the benefits of synthetic data if the released data is targeted to a workload of queries that are of interest to the analyst. 1