I/O-Efﬁcient Batched Union-Find and Its Applications to Terrain Analysis ∗ Pankaj K. Agarwal 1 Lars Arge 2,1 Ke Yi 1 1 Department of Computer Science, Duke University, Durham, NC 27708, USA. {pankaj,large,yike}@cs.duke.edu 2 Department of Computer Science, University of Aarhus, Aarhus, Denmark. large@daimi.au.dk ABSTRACT Despite extensive study over the last four decades and numerous applications, no I/O-efﬁcient algorithm is known for the union-ﬁnd problem. In this paper we present an I/O-efﬁcient algorithm for the batched (off-line) version of the union-ﬁnd problem. Given any sequence of N union and ﬁnd operations, where each union oper- ation joins two distinct sets, our algorithm uses O(SORT(N )) = O( N B log M/B N B ) I/Os, where M is the memory size and B is the disk block size. This bound is asymptotically optimal in the worst case. If there are union operations that join a set with itself, our algorithm uses O(SORT(N )+ MST(N )) I/Os, where MST(N ) is the number of I/Os needed to compute the minimum spanning tree of a graph with N edges. We also describe a simple and practical O(SORT(N ) log( N M ))-I/O algorithm for this problem, which we have implemented. We are interested in the union-ﬁnd problem because of its appli- cations in terrain analysis. A terrain can be abstracted as a height function deﬁned over R 2 , and many problems that deal with such functions require a union-ﬁnd data structure. With the emergence of modern mapping technologies, huge amount of elevation data is being generated that is too large to ﬁt in memory, thus I/O-efﬁcient algorithms are needed to process this data efﬁciently. In this paper, we study two terrain analysis problems that beneﬁt from a union- ﬁnd data structure: (i) computing topological persistence and (ii) constructing the contour tree. We give the ﬁrst O(SORT(N ))-I/O algorithms for these two problems, assuming that the input terrain is represented as a triangular mesh with N vertices. Finally, we report some preliminary experimental results, show- ing that our algorithms give order-of-magnitude improvement over previous methods on large data sets that do not ﬁt in memory. ∗ Work on this paper is supported by ARO grant W911NF-04-1- 0278. Pankaj K. Agarwal and Ke Yi are also supported by NSF under grants CCR-00-86013, EIA-01-31905, CCR-02-04118, and DEB-04-25465, by ARO grant DAAD19-03-1-0352, and by a grant from the U.S.–Israel Binational Science Foundation. Lars Arge is also supported by an Ole Rømer Scholarship from the Danish Na- tional Science Research Council. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. SCG’06, June 5–7, 2006, Sedona, Arizona, USA. Copyright 2006 ACM 1-59593-340-9/06/0006 ...$5.00. Categories and Subject Descriptors: F.2.2 [Theory of Computa- tion]: Nonnumerical Algorithms and Problems. General Terms: Algorithms, Experimentation. Keywords: Union-ﬁnd, terrain analysis, contour trees, I/O-efﬁcient algorithms. 1. INTRODUCTION The union-ﬁnd problem asks for maintaining a partition of a set U = {x1,x2,... } (the universe) and a representative element of each set in this partition under a sequence Σ of UNION(xi ,xj ) and FIND(xi) operations: UNION(xi ,xj ) joins the set containing xi and the set containing xj , and provides a new representative ele- ment for the new set; FIND(xi) returns the representative element of the set containing xi . In the on-line version of the problem, Σ is given one operation at a time, whereas in the batched (or off-line) version, the entire sequence Σ is known in advance. The union-ﬁnd problem is a fundamental algorithmic problem because of its appli- cations in numerous problems across different domains, from pro- gramming languages to graph and geometric algorithms, and from computational topology to computational biology. In many of these algorithms only the batched version of the problem is required; see [11, 13, 16, 20] for a sample of applications. The main motivation for our study of the union-ﬁnd problem arises from terrain modeling and analysis. A terrain can be ab- stracted as a height function deﬁned over R 2 , and there is a rich literature on the study of such functions. We are interested in two broad problems in terrain analysis, namely ﬂow and contour line analysis. A key step in the ﬂow analysis of a terrain is to mod- ify the height function so that “small” depressions on the terrain (sinks) disappear. We use the notion of topological persistence, in- troduced in [16], to address this problem. In the contour-line analy- sis, the notion of contour tree is critical [11, 29, 34]. Most existing topological persistence and contour tree algorithms rely on efﬁcient data structures for the batched union-ﬁnd problem. With the emergence of high-resolution terrain-mapping technolo- gies, huge amount of data is being generated that is too large to ﬁt in memory and has to reside on disks. Existing algorithms cannot handle such massive data sets, mainly because they optimize CPU running time while optimizing disk access is much more important. Motivated by these factors we propose efﬁcient algorithms for the batched union-ﬁnd problem in the I/O-model [3] (also known as the external memory model), and use them to develop I/O-efﬁcient al- gorithms for computing topological persistence and contour trees. Related results. In the I/O model, the machine consists of an inﬁnite- size external memory (disk) and a main memory of size M. A