Engineering a Topological Sorting Algorithm for Massive Graphs Deepak Ajwani * Adan Cosgaya-Lozano Norbert Zeh Abstract We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs (DAGs). No provably I/O-efficient algorithm for this problem is known. Sim- ilarly, the performance of our algorithm, which we call IterTS, may be poor in the worst case. However, our experiments show that IterTS achieves good perfor- mance in practise. The strategy of IterTS can be summarized as follows. We call an edge satisfied if its tail has a smaller number than its head. A numbering satisfying at least half the edges in the DAG is easy to find: a random numbering is expected to have this property. IterTS starts with such a numbering and then iteratively corrects the numbering to satisfy more and more edges until all edges are satisfied. To evaluate IterTS, we compared its running time to those of three competitors: PeelTS, an I/O-efficient implementation of the standard strategy of iteratively removing sources and sinks; ReachTS, an I/O-efficient implementation of a recent parallel divide-and-conquer algorithm based on reachability queries; and SeTS, standard DFS-based topological sorting built on top of a semi-external DFS algorithm. In our evaluation on various types of input graphs, IterTS consistently outperformed PeelTS and ReachTS, by at least an order of magnitude in most cases. SeTS outperformed IterTS on most graphs whose vertex sets fit in memory. However, IterTS often came close to the running time of SeTS on these inputs and, more importantly, SeTS was not able to process graphs whose vertex sets were beyond the size of main memory, while IterTS was able to process such inputs efficiently. MADALGO Center for Massive Data Algorithmics, Depart- ment of Computer Science, Aarhus University, Aarhus, Denmark. Email: ajwani@madalgo.au.dk. Research was supported in part by the Danish National Research Foundation. Travel to the con- ference was supported by IRCSET/IBM. Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada. Email: acosgaya@cs.dal.ca. Supported in part by NSERC. Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada. Email: nzeh@cs.dal.ca. This research was supported in part by NSERC and the Canada Research Chairs programme. 1 Introduction Let G =(V,E) be a directed acyclic graph (DAG) with n := |V | vertices and m := |E| edges. Topological sorting is the problem of finding a linear ordering of the vertices in V such that the tail of each edge in E precedes its head in the ordering. Linear-time algorithms for this problem are covered in standard undergraduate texts, as topological sorting captures the problem of finding a linear order of items or activities consistent with a set of pairwise ordering constraints, which arises in a number of applications. The problem of topologically sorting large DAGs arises, for example, in the application of recent multiple sequence alignment algorithms [20,21] to large collections of DNA sequences. Topologically sorting large DAGs is also an impor- tant building block for other I/O-efficient algorithms, mostly due to a technique called time-forward pro- cessing [9], which has proven useful in obtaining I/O- efficient solutions to a number of problems but requires the vertices of the graph to be given in topologically sorted order. Time-forward processing solves the fol- lowing “graph evaluation” problem: given a DAG each of whose vertices has a label φ(x), process its vertices in topologically sorted order and, for each vertex x, com- pute a new label ψ(x) from φ(x) and the ψ-labels of x’s in-neighbours. A simple example of this type of prob- lem is the evaluation of a Boolean circuit represented as a DAG: φ(·) assigns a Boolean function to each vertex, turning it into a logical gate; ψ(x) is the output of the gate represented by vertex x, given the inputs it receives from its in-neighbours. Since time-forward processing requires the vertices of the DAG to be given in topo- logically sorted order and no general I/O-efficient topo- logical sorting algorithm is known to date, time-forward processing has been applied only in situations where a topological ordering of the vertices can be obtained by using secondary information about the structure of the DAG (e.g., [3, 4, 13, 15]). A general topological sorting algorithm for massive graphs would greatly increase the applicability of this technique. Two simple linear-time algorithms for topological sorting are to repeatedly number and remove sources (in-degree-0 vertices) or to perform a depth-first search (DFS) of the graph and number the vertices in reverse Copyright © 2011 by SIAM Unauthorized reproduction is prohibited. 139