Corrigendum to “Efficient Similarity Search and Classification via Rank Aggregation” by Ronald Fagin, Ravi Kumar and D. Sivakumar (Proc. SIGMOD’03) Alexandr Andoni MIT Ronald Fagin IBM Almaden Ravi Kumar Yahoo! Research Mihai P ˇ atra¸ scu MIT D. Sivakumar Google Inc. Categories and Subject Descriptors: E.1 [Data struc- tures]. General Terms: Algorithms, theory. Keywords: Nearest neighbor, rank aggregation, score aggregation, median. In this corrigendum, we correct an error in the paper [1]. The error was discovered by Alexandr Andoni, and the cor- rected theorem is due to the three authors of [1], along with Alexandr Andoni and Mihai Pˇatra¸ scu. Theorem 4 of [1] states: Let D be a collection of n points in R d . Let r1,...rm be random unit vectors in R d , where m = αǫ -2 log n with α suitably chosen. Let q ∈ R d be an arbitrary point, and define, for each i with 1 ≤ i ≤ m, the ranked list Li of the n points in D by sorting them in in- creasing order of their distances to the projection of q along ri . For each element x of D, let medrank(x)= median(L1(x),...,Lm(x)). Let z be a member of D such that medrank(z) is minimized. Then with proba- bility at least 1−1/n, we have ‖z −q‖2 ≤ (1+ǫ)‖x−q‖2 for all x ∈ D. As stated, the above theorem does not hold, but a ver- sion of it holds if one replaces the median over ranks by a median over suitably defined scores. Below, we give a coun- terexample to the original theorem, and then present our modification to the theorem, and the resulting algorithm. 1. A COUNTEREXAMPLE Intuitively, the above theorem does not hold in the follow- ing situation. Suppose q is the query point, p is the nearest neighbor of q, and z is at distance (1 + ǫ)‖p − q‖2. For a random unit vector r, let rankr (p) denote the rank of the point p in the list Lr of the set D of points sorted by their distance to the projection of q along r. While it is true that rankr (p) < rankr (z) holds 1/2 + Ω(ǫ) fraction of the time (over the random choice of r), we cannot infer the same for the overall median rank when taking into the consideration the other points in D. In particular, a bad dataset is one where whenever rankr (p) < rankr (z) then about half of the time both ranks are high, but when rankr (p) > rankr (z) the Copyright is held by the author/owner(s). SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada. ACM 978-1-60558-102-6/08/06. point z has very small rank and p has a high rank. Then, in the end, p will have a high rank for about 75% of the time, while z has a high rank about 25% of time. Our counterex- ample constructs a set with (roughly) such characteristics. We give a specific set of n ≥ 10 points in 2-dimensional space. Consider the following point set for very small ǫ, illustrated in Fig. 1: • point q = (0, 0), the query; • point p = (0, 1), the nearest neighbor; • point z = (1 + ǫ, 0), the false nearest neighbor; • a set H of n-3 2 points all at distance (1 + ǫ) 2 from q, specifically at h = (1 + ǫ) 2 · ( 1 √ 2 , 1 √ 2 ); • a set S of the same size as H, namely n-3 2 points, all situated at s = (1 + ǫ) 2 · (1, 0). p z S H 1 q α β 1+ ǫ Figure 1: The pointset for our counterexample, where q is the query and p is the nearest neighbor. The grey point is the midpoint of the segment ps. Let r be a random unit vector in R 2 , and let Lr , rankr (x) be as defined earlier. Then we have the following two claims. Below, Prr denotes probability over the random choice of r. Claim 1.1. Prr [rankr (z) ≤ 2] ≥ 1/2 + Ω(ǫ). Claim 1.1 follows immediately from Lemma 3 of [1]. Claim 1.2. Prr [rankr (p) > |H|] ≥ 1/2 + Ω(1). We prove Claim 1.2 next. It is sufficient to consider r’s with non-negative x coordinate (since r and −r yield the same list Lr ), and identify r’s by their angle γr with the x axis. First, we note that rankr (p) ≤ rankr (s) iff γr ∈ [α, β], where α is angle formed by the perpendicular to the line 1375