Information Processing Letters 113 (2013) 409–413 Contents lists available at SciVerse ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Attacks on statistical databases: The highly noisy case Alexander Kantor, Kobbi Nissim ,1 Department of Computer Science, Ben-Gurion University, Israel article info abstract Article history: Received 15 September 2011 Received in revised form 8 July 2012 Accepted 8 March 2013 Available online 21 March 2013 Communicated by A. Tarlecki Keywords: Privacy Statistical databases Databases Learning with noise A formal investigation of the utility–privacy tradeoff in statistical databases has proved essential for the rigorous discussion of privacy of recent years. Initial results in this direction dealt with databases that answer (all) subset-sum queries to within some fixed distortion [Dinur and Nissim, PODC 2003]. Subsequent work has extended these results to the case where a constant portion of the queries are answered arbitrarily [Dwork, McSherry, and Talwar, STOC 2007], and furthermore to the case where up to almost half the queries are answered arbitrarily [Dwork and Yekhanin, CRYPTO 2008]. All these results demonstrate how an efficient attacker may learn the underlying database (exactly or approximately), and hence bear consequences to tasks such as private sanitization of data. We give the first efficient attack for the case where the queries that are answered within the fixed distortion form only a polynomially small fraction of the queries (the rest are answered arbitrarily). Our techniques borrow from program correction and learning in the presence of noise. © 2013 Elsevier B.V. All rights reserved. 1. Introduction We examine the possibility of providing any privacy when an attacker has access to a statistical database that returns answers satisfying some noise bound. This question was formalized by Dinur and Nissim [1] who introduced the notion of blatant non-privacy – a situation where an at- tacker can recover the database almost in its entirety, at low cost, and hence should be precluded by any reason- able privacy definition. For the fundamental setting of a statistical database x ∈{0, 1} n that holds a sensitive bit on each of n individuals, and subset sum queries (i.e., x, q where q ∈{0, 1}) that are (all) answered within additive error up to α, they proved that α = o( n) implies bla- tant non-privacy. This result was extended in [2] to the Research partly supported by the Israel Science Foundation (grant No. 860/06). * Corresponding author. E-mail addresses: kantoras@gmail.com (A. Kantor), kobbi@cs.bgu.ac.il (K. Nissim). 1 Work partly done while the author was at Microsoft Audience Intelli- gence, Israel. case where the database need answer only a large 1 ρ fraction of the queries (where ρ does not depend on n) within additive noise α = o( n), and the rest ρ fraction of the queries are answered arbitrarily. Our focus is on the case when the fraction of queries on which the database algorithm deviates from α is large. This setting was examined in [2] with respect to adver- saries that are query efficient, but computationally unlim- ited. They showed that with a database algorithm that an- swers β = 1 2 + ε fraction of the queries within error α, a computationally inefficient attacker making O ( n ε ) uni- formly chosen queries can recover all but O (( α ε ) 2 ) entries of the database. For the case β< 1 2 , they showed a list de- coding result, where O ( n β 2 ) uniformly chosen queries to a database answering a β fraction of the queries within er- ror α determine O ( 1 β ) candidates, at least one of which is within Hamming distance O ( α 2 β 4 ) from the real database. It is not yet known whether these attacks can be made effi- cient. Dwork and Yekhanin [3] described the first compu- tationally efficient attack for the case β = 1 2 + ε. They 0020-0190/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ipl.2013.03.005