Fundamental Limits of Identification: Identification rate, search and memory complexity trade–off Farzad Farhadzadeh Dep. of Computer Science University of Geneva Geneva, Switzerland Email: farzad.farhadzadeh@unige.ch Frans M.J. Willems Dep. of Electrical Eng. Eindhoven University of Technology Eindhoven, The Netherlands Email: f.m.j.willems@tue.nl Sviatoslav Voloshynovskiy Dep. of Computer Science University of Geneva Geneva, Switzerland Email: svolos@unige.ch Abstract—In this paper, we introduce a new generalized scheme to resolve the trade–off between the identification rate, search and memory complexities in large–scale identification systems. The main contribution of this paper consists in a special database organization based on assigning entries of a database to a set of predefined and possibly overlapping clusters, where the cluster representative points are generated based on statistics of both entries of the database and queries. The decoding procedure is accomplished in two stages: At the first stage, a list of clusters related to the query is estimated, then refinement checks are performed to all members of these clusters to produce a unique index at the second stage. The proposed scheme generalizes several practical searching in identification systems as well as makes it possible to approach a new achievable region of search– memory complexity trade–off. I. I NTRODUCTION The identification or the nearest neighbor search is a re- search problem that simultaneously has emerged in a number of applications such as human biometrics [1], content man- agement (multimedia retrieval) [2], multimedia security (copy detection, content identification and tracking) [3] as well as physical object security [4]. An identification system [1] consists of two main phases: enrollment and identification. In the first phase, the enroll- ment, feature vectors representing digital contents, humans or physical objects are extracted and stored in a database. In the identification phase, a noisy (degraded) counterpart of an enrolled data, defined as query, is presented to the identification system to identify the query by comparing to feature vectors stored in the database. In modern applications, the size of a database might be of order of several billions. Therefore, theoretical investigation and development of practical methods achieving identification capacity [1] is of great interest. An efficient approach should satisfy several important requirements. First, users should be able to identify the objects or individuals reliably (reliability). Secondly, the decoding method should be as fast as possible in time (search complexity). Finally, it should require the least possible amount of memory for both the items and the indexing structure (memory complexity). These triple condi- tions require to solve an information-theoretical problem that considers maximization of identification rate, minimization of computational complexity and memory complexity. It should be pointed out that all these requirements contradict each other, and in fact this triple trade-off is still an open and emerging research problem. In principle, an identification system can perform an exhaus- tive search on all entries of the database to find the best match. [5] gives an extensive overview of methods to reduce search complexity in metric spaces. [2] compares indexing techniques to methods based on what they call vector-approximations (VA). Similar to these VA methods are the fingerprinting techniques that used in content-based audio identification [6] observed that for searching in high-dimensional spaces quantization methods like VA outperform indexing methods. In an information-theoretical context such methods would be referred to as quantization methods. Quantization can also be used in the enrollment phase with the objective to compress the database. [7] exploits quantiza- tion during enrollment and consider the fundamental trade-off between compression rate and reconstruction distortion. Later [8] considered the trade-off between enrollment compression rate and identification rate. [4] exploits a search scheme based on Hamming sphere around the noisy feature vector, that can reduce search complexity and simultaneously achieves the identification capacity. However, it should be noted that this scheme is efficient only for low degradations between queries and enrolled data. This paper is a generalization of the scheme introduced by Willems in [9] to speed up the search process by means of clustering. Where the system upon observing a query, first detects to which cluster the related item belongs, and after that decides about the item itself (two-stage identification). The main differences between the current manuscript and [9] are: (a) generalization of cluster representative points, considered as auxiliary random variables, based on statistics of both entries of the database and queries, while in [9] the cluster representative points have been generated only based on statistics of queries; (b) estimation of a list of clusters versus unique estimation in [9], upon observing a query at the first stage of decoding; (c) the memory complexity was not addressed directly in [9], for the analysis of the triple trade– off; (d) a new result on search–memory complexity region of capacity achieving identification systems. In the next section we present our model of an identification