arXiv:1202.3399v2 [cs.DB] 21 Feb 2012 Measuring the achievable error of query sets under diﬀerential privacy Chao Li, Gerome Miklau University of Massachusetts Amherst, Massachusetts, USA {chaoli, miklau}@cs.umass.edu Abstract A common goal of privacy research is to release synthetic data that satisﬁes a formal privacy guarantee and can be used by an analyst in place of the original data. To achieve reasonable accuracy, a synthetic data set must be tuned to support a speciﬁed set of queries accurately, sacriﬁcing ﬁdelity for other queries. This work considers methods for producing synthetic data under diﬀerential privacy and investigates what makes a set of queries “easy” or “hard” to answer. We consider answering sets of linear counting queries using the matrix mechanism [15], a recent diﬀerentially-private mechanism that can reduce error by adding complex correlated noise adapted to a speciﬁed workload. Our main result is a novel lower bound on the minimum total error required to simultaneously release answers to a set of workload queries. The bound reveals that the hardness of a query workload is related to the spectral properties of the workload when it is represented in matrix form. The bound is tight and, because it satisﬁes important soundness criteria, it can serve as a reliable numerical measure of the “error complexity” of a workload. 1 Introduction Diﬀerential privacy [8] is a rigorous privacy standard oﬀering participants in a data set the appealing guar- antee that released query answers will be nearly indistinguishable whether or not their data is included. The earliest methods for achieving diﬀerential privacy were interactive: an analyst submits a query to the server and receives a noisy query answer. Further queries may be submitted, but increasing noise will be added and the server may eventually refuse to answer subsequent queries. To avoid some of the challenges of the interactive model, diﬀerential privacy has often been adapted to a non-interactive setting, where a common goal has been to release a synthetic data set that the analyst can use in place of the original data. There are a number of appealing beneﬁts to releasing a private synthetic database: the analyst need not carefully divide their task into individual queries, and can use familiar data processing techniques on the synthetic data; the privacy budget will not be exhausted before the queries of interest have been answered; and the analyst can carry out data analysis using their own resources and without revealing their tasks to the data owner. There are limits, however, to private synthetic data generation. When a synthetic dataset is released, the server no longer controls how many questions the analyst computes from the data. Dinur and Nissim showed that accurately answering “too many” queries of a certain type is incompatible with any reasonable notion of privacy, allowing reconstruction of the database with high probability [6]. This tempers the hopes of private synthetic data to some degree, suggesting that if a synthetic dataset is to be private, then it can be accurate only for a speciﬁc class of queries, and may need to sacriﬁce accuracy for other queries. A number of methods have been proposed for releasing accurate synthetic data for speciﬁc sets of queries [5, 15, 13, 22, 23, 3, 1, 19]. These results show that it is still possible to achieve many of the beneﬁts of synthetic data if the released data is targeted to a workload of queries that are of interest to the analyst. 1