Fundamental Limits of Identiﬁcation: Identiﬁcation rate, search and memory complexity trade–off Farzad Farhadzadeh Dep. of Computer Science University of Geneva Geneva, Switzerland Email: farzad.farhadzadeh@unige.ch Frans M.J. Willems Dep. of Electrical Eng. Eindhoven University of Technology Eindhoven, The Netherlands Email: f.m.j.willems@tue.nl Sviatoslav Voloshynovskiy Dep. of Computer Science University of Geneva Geneva, Switzerland Email: svolos@unige.ch Abstract—In this paper, we introduce a new generalized scheme to resolve the trade–off between the identiﬁcation rate, search and memory complexities in large–scale identiﬁcation systems. The main contribution of this paper consists in a special database organization based on assigning entries of a database to a set of predeﬁned and possibly overlapping clusters, where the cluster representative points are generated based on statistics of both entries of the database and queries. The decoding procedure is accomplished in two stages: At the ﬁrst stage, a list of clusters related to the query is estimated, then reﬁnement checks are performed to all members of these clusters to produce a unique index at the second stage. The proposed scheme generalizes several practical searching in identiﬁcation systems as well as makes it possible to approach a new achievable region of search– memory complexity trade–off. I. I NTRODUCTION The identiﬁcation or the nearest neighbor search is a re- search problem that simultaneously has emerged in a number of applications such as human biometrics [1], content man- agement (multimedia retrieval) [2], multimedia security (copy detection, content identiﬁcation and tracking) [3] as well as physical object security [4]. An identiﬁcation system [1] consists of two main phases: enrollment and identiﬁcation. In the ﬁrst phase, the enroll- ment, feature vectors representing digital contents, humans or physical objects are extracted and stored in a database. In the identiﬁcation phase, a noisy (degraded) counterpart of an enrolled data, deﬁned as query, is presented to the identiﬁcation system to identify the query by comparing to feature vectors stored in the database. In modern applications, the size of a database might be of order of several billions. Therefore, theoretical investigation and development of practical methods achieving identiﬁcation capacity [1] is of great interest. An efﬁcient approach should satisfy several important requirements. First, users should be able to identify the objects or individuals reliably (reliability). Secondly, the decoding method should be as fast as possible in time (search complexity). Finally, it should require the least possible amount of memory for both the items and the indexing structure (memory complexity). These triple condi- tions require to solve an information-theoretical problem that considers maximization of identiﬁcation rate, minimization of computational complexity and memory complexity. It should be pointed out that all these requirements contradict each other, and in fact this triple trade-off is still an open and emerging research problem. In principle, an identiﬁcation system can perform an exhaus- tive search on all entries of the database to ﬁnd the best match. [5] gives an extensive overview of methods to reduce search complexity in metric spaces. [2] compares indexing techniques to methods based on what they call vector-approximations (VA). Similar to these VA methods are the ﬁngerprinting techniques that used in content-based audio identiﬁcation [6] observed that for searching in high-dimensional spaces quantization methods like VA outperform indexing methods. In an information-theoretical context such methods would be referred to as quantization methods. Quantization can also be used in the enrollment phase with the objective to compress the database. [7] exploits quantiza- tion during enrollment and consider the fundamental trade-off between compression rate and reconstruction distortion. Later [8] considered the trade-off between enrollment compression rate and identiﬁcation rate. [4] exploits a search scheme based on Hamming sphere around the noisy feature vector, that can reduce search complexity and simultaneously achieves the identiﬁcation capacity. However, it should be noted that this scheme is efﬁcient only for low degradations between queries and enrolled data. This paper is a generalization of the scheme introduced by Willems in [9] to speed up the search process by means of clustering. Where the system upon observing a query, ﬁrst detects to which cluster the related item belongs, and after that decides about the item itself (two-stage identiﬁcation). The main differences between the current manuscript and [9] are: (a) generalization of cluster representative points, considered as auxiliary random variables, based on statistics of both entries of the database and queries, while in [9] the cluster representative points have been generated only based on statistics of queries; (b) estimation of a list of clusters versus unique estimation in [9], upon observing a query at the ﬁrst stage of decoding; (c) the memory complexity was not addressed directly in [9], for the analysis of the triple trade– off; (d) a new result on search–memory complexity region of capacity achieving identiﬁcation systems. In the next section we present our model of an identiﬁcation