Information–Theoretic Analysis of Privacy Protection for Noisy Identification Based on Soft Fingerprinting Vladimir B. Balakirsky, Svyatoslav Voloshynovskiy, Oleksiy Koval, Taras Holotyak Data Security Association “Confident”, Russia University of Geneva, Switzerland e-mail: v b balakirsky@rambler.ru, {svolos, Oleksiy.Koval, Taras.Holotyak}@unige.ch ABSTRACT Identification of contents or objects based on some data that are stored/distributed in public domain is required in vari- ous applications. At the same time, these data should not reveal any information about original content or object that may be misused in terms of privacy leakage. We consider a privacy protection strategy based on reliable components of data and investigate the performance of this scheme with respect to achievable identification rate and privacy leak. The data stored/distributed in the public domain are bi- nary, while the encoder and the decoder operate with real data. The advocated strategy is referred to as soft finger- printing. Keywords Information theory, Soft fingerprinting, Identification rate, Privacy leak. 1. INTRODUCTION Many problems of modern multimedia management (content filtering, content retrieval/search, content tagging and rec- ommendation), multimedia security (copyright protection, broadcast monitoring, etc.) and physical object security such as biometrics and anti-counterfeiting require efficient tool providing content identification. To find the reason- able trade-off between accuracy, privacy leak, complexity and memory storage, most identification techniques use bi- nary digital fingerprinting. In most cases, a binary finger- print represents a short, robust and distinctive content de- scription that allows to overcome fundamental sensitivity restrictions of classical cryptographic encryption and hash- ing to minor noise in input data [1], [2], [3]. The binary fingerprint is typically constructed based on the dimensionality reduction followed by binarization [4]. Mostly cases, the soft information about the magnitudes of trans- formed components is neglected and some privacy amplifi- cation procedure is applied to binary data to avoid the re- covery of the original data based on its binary counterpart. The overview of the state-of-the-art of privacy amplification based on encryption and randomization/compression is pre- sented in [5], while privacy protection using data hiding ap- proach is proposed in [6]. It is shown that the latter strategy is more efficient, when information about the fingerprint bit reliability is used in terms of achievable identification rate- privacy leak trade-off. Contrarily to randomization/compression based privacy am- plification, which blindly flips certain fraction of fingerprint bits, the privacy amplification based on data hiding uses soft information about the bit reliability [4] to randomize only the least reliable bits while keeping the most reliable bits unchanged. Additionally, the positions of the most reliable bits in the fingerprint vector are secret and defined by the soft information that is only available to the authorized en- coder/decoder pair and is not stored in the public domain. The selection of the reliable components can be achieved based on either thresholding of magnitudes of projected com- ponents or order statistics by selecting the fraction of the largest components [6], [7]. The thresholding approach is an element–wise operation that ensures the independence of other vector components. However, it leads to the vari- able cardinality sets of reliable components that might rep- resent certain challenges for the construction of practical codes. Alternatively, the order statistics approaches guar- antees the fixed cardinality sets and leads to simple imple- mentation. Since the order statistics are based on the entire vector, the resulting components can not be considered in- dependent that should be properly analyzed in the context of achievable rate–privacy leak trade–off. We will consider the problem for Gaussian data, but the obtained results can be extended to other probability distri- butions. 2. PROBLEM FORMULATION Let us introduce the following notation. Let w ≤ n be a fixed integer and let s =(s1,...,sn) ∈ {0, 1} n be a binary vector of the Hamming weight w, i.e., ˛ ˛ ˛ n j ∈{1,...,n} : sj =1 o˛ ˛ ˛ = w. Given a float–valued vector x =(x1,...,xn) ∈ R n , let bin(x) = (bin(x1),..., bin(xn)) ∈{0, 1} n denote the binary vector constructed according with the rules bin(xj )= 0, if xj < 0, 1, if xj ≥ 0 for all j =1,...,n. Furthermore, let abs(x)=(|x1|,..., |xn|) ∈ (R + ) n In other words, the vectors bin(x) and abs(x) contain infor- mation about the signs and the magnitudes of components of the vector x, respectively. The encoder, described below, transforms the vector x to a binary vector b =(b1,...,bn). It keeps w components of