Information Processing Letters 113 (2013) 409–413
Contents lists available at SciVerse ScienceDirect
Information Processing Letters
www.elsevier.com/locate/ipl
Attacks on statistical databases: The highly noisy case
✩
Alexander Kantor, Kobbi Nissim
∗,1
Department of Computer Science, Ben-Gurion University, Israel
article info abstract
Article history:
Received 15 September 2011
Received in revised form 8 July 2012
Accepted 8 March 2013
Available online 21 March 2013
Communicated by A. Tarlecki
Keywords:
Privacy
Statistical databases
Databases
Learning with noise
A formal investigation of the utility–privacy tradeoff in statistical databases has proved
essential for the rigorous discussion of privacy of recent years. Initial results in this
direction dealt with databases that answer (all) subset-sum queries to within some fixed
distortion [Dinur and Nissim, PODC 2003]. Subsequent work has extended these results
to the case where a constant portion of the queries are answered arbitrarily [Dwork,
McSherry, and Talwar, STOC 2007], and furthermore to the case where up to almost
half the queries are answered arbitrarily [Dwork and Yekhanin, CRYPTO 2008]. All these
results demonstrate how an efficient attacker may learn the underlying database (exactly
or approximately), and hence bear consequences to tasks such as private sanitization of
data.
We give the first efficient attack for the case where the queries that are answered within
the fixed distortion form only a polynomially small fraction of the queries (the rest are
answered arbitrarily). Our techniques borrow from program correction and learning in the
presence of noise.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
We examine the possibility of providing any privacy
when an attacker has access to a statistical database that
returns answers satisfying some noise bound. This question
was formalized by Dinur and Nissim [1] who introduced
the notion of blatant non-privacy – a situation where an at-
tacker can recover the database almost in its entirety, at
low cost, and hence should be precluded by any reason-
able privacy definition. For the fundamental setting of a
statistical database x ∈{0, 1}
n
that holds a sensitive bit on
each of n individuals, and subset sum queries (i.e., x, q
where q ∈{0, 1}) that are (all) answered within additive
error up to α, they proved that α = o(
√
n) implies bla-
tant non-privacy. This result was extended in [2] to the
✩
Research partly supported by the Israel Science Foundation (grant
No. 860/06).
*
Corresponding author.
E-mail addresses: kantoras@gmail.com (A. Kantor), kobbi@cs.bgu.ac.il
(K. Nissim).
1
Work partly done while the author was at Microsoft Audience Intelli-
gence, Israel.
case where the database need answer only a large 1 − ρ
∗
fraction of the queries (where ρ
∗
does not depend on n)
within additive noise α = o(
√
n), and the rest ρ
∗
fraction
of the queries are answered arbitrarily.
Our focus is on the case when the fraction of queries
on which the database algorithm deviates from α is large.
This setting was examined in [2] with respect to adver-
saries that are query efficient, but computationally unlim-
ited. They showed that with a database algorithm that an-
swers β =
1
2
+ ε fraction of the queries within error α,
a computationally inefficient attacker making O (
n
ε
) uni-
formly chosen queries can recover all but O ((
α
ε
)
2
) entries
of the database. For the case β<
1
2
, they showed a list de-
coding result, where O (
n
β
2
) uniformly chosen queries to a
database answering a β fraction of the queries within er-
ror α determine O (
1
β
) candidates, at least one of which is
within Hamming distance O (
α
2
β
4
) from the real database. It
is not yet known whether these attacks can be made effi-
cient.
Dwork and Yekhanin [3] described the first compu-
tationally efficient attack for the case β =
1
2
+ ε. They
0020-0190/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.ipl.2013.03.005