I/O-Efficient Batched Union-Find and Its Applications to Terrain Analysis ∗ Pankaj K. Agarwal 1 Lars Arge 2,1 Ke Yi 1 1 Department of Computer Science, Duke University, Durham, NC 27708, USA. {pankaj,large,yike}@cs.duke.edu 2 Department of Computer Science, University of Aarhus, Aarhus, Denmark. large@daimi.au.dk ABSTRACT Despite extensive study over the last four decades and numerous applications, no I/O-efficient algorithm is known for the union-find problem. In this paper we present an I/O-efficient algorithm for the batched (off-line) version of the union-find problem. Given any sequence of N union and find operations, where each union oper- ation joins two distinct sets, our algorithm uses O(SORT(N )) = O( N B log M/B N B ) I/Os, where M is the memory size and B is the disk block size. This bound is asymptotically optimal in the worst case. If there are union operations that join a set with itself, our algorithm uses O(SORT(N )+ MST(N )) I/Os, where MST(N ) is the number of I/Os needed to compute the minimum spanning tree of a graph with N edges. We also describe a simple and practical O(SORT(N ) log( N M ))-I/O algorithm for this problem, which we have implemented. We are interested in the union-find problem because of its appli- cations in terrain analysis. A terrain can be abstracted as a height function defined over R 2 , and many problems that deal with such functions require a union-find data structure. With the emergence of modern mapping technologies, huge amount of elevation data is being generated that is too large to fit in memory, thus I/O-efficient algorithms are needed to process this data efficiently. In this paper, we study two terrain analysis problems that benefit from a union- find data structure: (i) computing topological persistence and (ii) constructing the contour tree. We give the first O(SORT(N ))-I/O algorithms for these two problems, assuming that the input terrain is represented as a triangular mesh with N vertices. Finally, we report some preliminary experimental results, show- ing that our algorithms give order-of-magnitude improvement over previous methods on large data sets that do not fit in memory. ∗ Work on this paper is supported by ARO grant W911NF-04-1- 0278. Pankaj K. Agarwal and Ke Yi are also supported by NSF under grants CCR-00-86013, EIA-01-31905, CCR-02-04118, and DEB-04-25465, by ARO grant DAAD19-03-1-0352, and by a grant from the U.S.–Israel Binational Science Foundation. Lars Arge is also supported by an Ole Rømer Scholarship from the Danish Na- tional Science Research Council. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SCG’06, June 5–7, 2006, Sedona, Arizona, USA. Copyright 2006 ACM 1-59593-340-9/06/0006 ...$5.00. Categories and Subject Descriptors: F.2.2 [Theory of Computa- tion]: Nonnumerical Algorithms and Problems. General Terms: Algorithms, Experimentation. Keywords: Union-find, terrain analysis, contour trees, I/O-efficient algorithms. 1. INTRODUCTION The union-find problem asks for maintaining a partition of a set U = {x1,x2,... } (the universe) and a representative element of each set in this partition under a sequence Σ of UNION(xi ,xj ) and FIND(xi) operations: UNION(xi ,xj ) joins the set containing xi and the set containing xj , and provides a new representative ele- ment for the new set; FIND(xi) returns the representative element of the set containing xi . In the on-line version of the problem, Σ is given one operation at a time, whereas in the batched (or off-line) version, the entire sequence Σ is known in advance. The union-find problem is a fundamental algorithmic problem because of its appli- cations in numerous problems across different domains, from pro- gramming languages to graph and geometric algorithms, and from computational topology to computational biology. In many of these algorithms only the batched version of the problem is required; see [11, 13, 16, 20] for a sample of applications. The main motivation for our study of the union-find problem arises from terrain modeling and analysis. A terrain can be ab- stracted as a height function defined over R 2 , and there is a rich literature on the study of such functions. We are interested in two broad problems in terrain analysis, namely flow and contour line analysis. A key step in the flow analysis of a terrain is to mod- ify the height function so that “small” depressions on the terrain (sinks) disappear. We use the notion of topological persistence, in- troduced in [16], to address this problem. In the contour-line analy- sis, the notion of contour tree is critical [11, 29, 34]. Most existing topological persistence and contour tree algorithms rely on efficient data structures for the batched union-find problem. With the emergence of high-resolution terrain-mapping technolo- gies, huge amount of data is being generated that is too large to fit in memory and has to reside on disks. Existing algorithms cannot handle such massive data sets, mainly because they optimize CPU running time while optimizing disk access is much more important. Motivated by these factors we propose efficient algorithms for the batched union-find problem in the I/O-model [3] (also known as the external memory model), and use them to develop I/O-efficient al- gorithms for computing topological persistence and contour trees. Related results. In the I/O model, the machine consists of an infinite- size external memory (disk) and a main memory of size M. A