Computing the drainage network on huge grid terrains Thiago L. Gomes Universidade Fed. de Viçosa Viçosa, MG, Brazil thiago.luange@ufv.br Salles V. G. Magalhães Universidade Fed. de Viçosa Viçosa, MG, Brazil salles@ufv.br Marcus V. A. Andrade Universidade Fed. de Viçosa Viçosa, MG, Brazil marcus@ufv.br W. Randolph Franklin Rensselaer Polytechnic Inst. Troy, NY, USA mail@wrfranklin.org Guilherme C. Pena Universidade Fed. de Viçosa Viçosa, MG, Brazil guilherme.pena@ufv.br ABSTRACT We present a very efficient algorithm, named EMFlow , and its implementation to compute the drainage network, that is, the flow direction and flow accumulation on huge terrains stored in external memory. It is about 20 times faster than the two most recent and most efficient published methods: TerraFlow and r.watershed.seg. Since processing large datasets can take hours, this improvement is very significant. The EMFlow is based on our previous method RWFlood which uses a flooding process to compute the drainage net- work. And, to reduce the total number of I/O operations, EMFlow is based on grouping the terrain cells into blocks which are stored in a special data structure managed as a cache memory. Also, a new strategy is adopted to subdivide the terrains in islands which are processed separately. Because of the recent increase in the volume of high reso- lution terrestrial data, the internal memory algorithms do not run well on most computers and, thus, optimizing the massive data processing algorithm simultaneously for data movement and computation has been a challenge for GIS. Categories and Subject Descriptors F.2.2 [Nonnumerical Algorithms and Problems]: Geo- metrical problems and computations General Terms Algorithms, Experimentation, Performance Keywords Terrain modeling, GIS, External memory processing, Hydrol- ogy 1. INTRODUTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL GIS ’12, November 6-9, 2012. Redondo Beach, CA, USA Copyright (c) 2012 ACM ISBN 978-1-4503-1691-0/12/11...$15.00. Many important applications in Geographical Information Science (GIS) as hydrology, visibility, routing, etc require terrain data processing and these applications have become a challenge for GIS because they have to process a huge volume of high resolution terrestrial data. On most computers, the internal memory algorithms do not run well for such volume of data since a large number of I/O operations is necessary. For example, NASA’s Shuttle Radar Topography Mission (SRTM) acquired 30 meters resolution terrain data for much of the world, generating about 10 terabytes of data. The datasets can be even bigger considering the technological advances which allow data acquisition at sub-meter resolution. Thus, it is important to optimize the massive data pro- cessing algorithms simultaneously for computation and data movement between external and internal memory since pro- cessing data in external memory takes much more time. That is, the algorithms for external memory processing must be designed and implemented to minimize the number of “I/O” operations for swapping data between main memory and disk. More precisely, the algorithms for external memory pro- cessing should be designed and analyzed considering a compu- tational model where the algorithm complexity is evaluated based on data transfer operations instead of CPU process- ing operations. A model often used, proposed by Aggarwal and Vitter [1], defines an I/O operation as the transfer of one disk block of size B between the external and internal memory; the performance is measured by number of such I/O operations. The internal computation time is assumed to be comparatively insignificant. The algorithm complexity is defined based on the number of I/O operations executed by fundamental operations such as scanning or sorting n contiguous elements stored in external memory. Those are scan(n)= θ(n/B) and sort(n)= θ n B log M/B n B , where M is the internal memory size. Hydrological applications generally require the drainage network computation of a terrain, consisting of the flow direc- tion and flow accumulation. Intuitively, they are the path that water flows through the terrain and the amount of water that flows into each terrain cell supposing that each cell receives a rain drop [12]. As broadly described [2, 4, 10, 11], it is a very time-consuming process, mainly on huge terrains requiring external memory processing. Indeed, in many situations, the flow direction can not be straightforwardly determined as for example, in a local minimum terrain cell.