SEARCHING FOR NEAREST NEIGHBORS WITH A DENSE SPACE PARTITIONING Tuan Anh Nguyen Yusuke Matsui Toshihiko Yamasaki Kiyoharu Aizawa The University of Tokyo ABSTRACT Product quantization based approximate nearest neighbor search with the use of inverted index structures have recently received increasing attention. In this paper, we propose a new inverted index structure for searching nearest neighbors in very large datasets of high dimensional data. For data index- ing, our proposed method creates a dense space partitioning using multiple centroids based assigning, which generates shorter candidate lists and improves the search speed. Our experiments with a dataset of one billion SIFT features show that while achieving higher accuracy, our method demon- strates better performances on search speed compared to IV- FADC, the conventional product quantization based inverted index structure. Index Terms— computer vision, image retrieval, nearest neighbor search, product quantization, inverted index 1. INTRODUCTION For most computer vision and image retrieval applications [1, 2, 3], approximate nearest neighbor search [4, 5, 6] is an im- portant problem. The most popular methods for solving this problem are using Euclidean Locality Sensitive-Hashing [5] or tree-based solutions [4, 7, 8, 9, 10]. Most of these meth- ods, however, are memory consuming because of the use of hash tables and trees. Product quantization based approxi- mate nearest neighbor search [6, 11] is more preferable in terms of trade-off between speed, memory usage and accu- racy. In this context, J´ egou et al. [6] proposed an inverted ﬁle system with the asymmetric distance computation (IVFADC) that is used in almost all studies about product quantization based methods for nearest neighbor search. The method of IVFADC consists of a data indexing algo- rithm and a search algorithm. For data indexing, the dataset is partitioned into many cells using k-means [12]. A resid- ual between each database vector and its assigned centroid is encoded and stored using product quantization. For search- ing, a query vector is assigned to its nearest cell, and database vectors that are assigned to the cell are evaluated instead of traversing all the database vectors. This approach reduces the number of candidates and makes the search faster. It takes a small memory usage, since the number of the centroids is small. c 2 c 3 (a) IVFADC (k-means) [6] c 1 c 2 c 3 q (b) Proposed method Fig. 1. The illustration of the proposed method with 2- dimensional data and a quantizer that has 3 centroids: (a) IV- FADC (k-means): A query q is assigned to a cell c 1 , and database vectors in cell c 1 (the gray area) are evaluated. ; (b) In our proposed method, the query is assigned to cell c 12 with a strict assigning condition: the nearest neighbor of q in the centroids set is c 1 and the second nearest neighbor is c 2 . The number of cells is increasing in this data structure, and the space partitioning becomes dense. In this paper, we improve IVFADC and propose a new inverted index structure that can generate a large amount of cells (Fig. 1). Our idea is simple: we use more than one near- est centroids to create the cells while creating a dense space partitioning. This approach drastically increases the number of cells from K to K(K − 1), where K is the number of cells in IVFADC. Since the number of database vectors in each cell decreases when the number of cells increases, the search is made faster. Hence, the recall is also improved, since many wrong candidates are removed and replaced by other possi- ble candidates. Recently, there are several approaches that optimize the quantization distortion [13, 14, 15] or the dis- tance estimation [11] to achieve a better recall for the product quantization based method. In this paper, differing from those approaches, we focus on the inverted index structure. 2. PRODUCT QUANTIZATION BASED INDEXING In this section, we brieﬂy review the product quantization based indexing method in [6]. Let q ∈ R d be the query and Y = {y i } N i=1 a set of vectors in which we want to ﬁnd the