Information–Theoretic Analysis of Privacy Protection for Noisy Identification Based on Soft Fingerprinting Vladimir B. Balakirsky, Svyatoslav Voloshynovskiy, Oleksiy Koval, Taras Holotyak Data Security Association “Confident”, Russia University of Geneva, Switzerland e-mail: v b balakirsky@rambler.ru, {svolos, Oleksiy.Koval, Taras.Holotyak}@unige.ch ABSTRACT Identiﬁcation of contents or objects based on some data that are stored/distributed in public domain is required in vari- ous applications. At the same time, these data should not reveal any information about original content or object that may be misused in terms of privacy leakage. We consider a privacy protection strategy based on reliable components of data and investigate the performance of this scheme with respect to achievable identiﬁcation rate and privacy leak. The data stored/distributed in the public domain are bi- nary, while the encoder and the decoder operate with real data. The advocated strategy is referred to as soft ﬁnger- printing. Keywords Information theory, Soft ﬁngerprinting, Identiﬁcation rate, Privacy leak. 1. INTRODUCTION Many problems of modern multimedia management (content ﬁltering, content retrieval/search, content tagging and rec- ommendation), multimedia security (copyright protection, broadcast monitoring, etc.) and physical object security such as biometrics and anti-counterfeiting require eﬃcient tool providing content identiﬁcation. To ﬁnd the reason- able trade-oﬀ between accuracy, privacy leak, complexity and memory storage, most identiﬁcation techniques use bi- nary digital ﬁngerprinting. In most cases, a binary ﬁnger- print represents a short, robust and distinctive content de- scription that allows to overcome fundamental sensitivity restrictions of classical cryptographic encryption and hash- ing to minor noise in input data [1], [2], [3]. The binary ﬁngerprint is typically constructed based on the dimensionality reduction followed by binarization [4]. Mostly cases, the soft information about the magnitudes of trans- formed components is neglected and some privacy ampliﬁ- cation procedure is applied to binary data to avoid the re- covery of the original data based on its binary counterpart. The overview of the state-of-the-art of privacy ampliﬁcation based on encryption and randomization/compression is pre- sented in [5], while privacy protection using data hiding ap- proach is proposed in [6]. It is shown that the latter strategy is more eﬃcient, when information about the ﬁngerprint bit reliability is used in terms of achievable identiﬁcation rate- privacy leak trade-oﬀ. Contrarily to randomization/compression based privacy am- pliﬁcation, which blindly ﬂips certain fraction of ﬁngerprint bits, the privacy ampliﬁcation based on data hiding uses soft information about the bit reliability [4] to randomize only the least reliable bits while keeping the most reliable bits unchanged. Additionally, the positions of the most reliable bits in the ﬁngerprint vector are secret and deﬁned by the soft information that is only available to the authorized en- coder/decoder pair and is not stored in the public domain. The selection of the reliable components can be achieved based on either thresholding of magnitudes of projected com- ponents or order statistics by selecting the fraction of the largest components [6], [7]. The thresholding approach is an element–wise operation that ensures the independence of other vector components. However, it leads to the vari- able cardinality sets of reliable components that might rep- resent certain challenges for the construction of practical codes. Alternatively, the order statistics approaches guar- antees the ﬁxed cardinality sets and leads to simple imple- mentation. Since the order statistics are based on the entire vector, the resulting components can not be considered in- dependent that should be properly analyzed in the context of achievable rate–privacy leak trade–oﬀ. We will consider the problem for Gaussian data, but the obtained results can be extended to other probability distri- butions. 2. PROBLEM FORMULATION Let us introduce the following notation. Let w ≤ n be a ﬁxed integer and let s =(s1,...,sn) ∈ {0, 1} n be a binary vector of the Hamming weight w, i.e., ˛ ˛ ˛ n j ∈{1,...,n} : sj =1 o˛ ˛ ˛ = w. Given a ﬂoat–valued vector x =(x1,...,xn) ∈ R n , let bin(x) = (bin(x1),..., bin(xn)) ∈{0, 1} n denote the binary vector constructed according with the rules bin(xj )=  0, if xj < 0, 1, if xj ≥ 0 for all j =1,...,n. Furthermore, let abs(x)=(|x1|,..., |xn|) ∈ (R + ) n In other words, the vectors bin(x) and abs(x) contain infor- mation about the signs and the magnitudes of components of the vector x, respectively. The encoder, described below, transforms the vector x to a binary vector b =(b1,...,bn). It keeps w components of