238 IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 2, FEBRUARY 1989 Parallel Sorting in Two-Dimensional VLSI Models of Computation Abstract-Shear-sort opened new avenues in the research of sorting techniques for mesh-connected processor arrays. The algorithm is extremely simple and converges to a snake-like sorted sequence with a time complexity which is suboptimal by a logarithmic factor. The techniques used for analyzing shear-sort have been used to derive more efficient algorithms, which have important ramifications both from practical and theoretical viewpoints. Although the algorithms described apply to any general two-dimensional computational model, the focus of most discussions is on mesh-connected computers which are now commercially available. In spite of a rich history of O(n) sorting algorithms on an n x n SIMD mesh, the constants associated with the leading term (i.e., n) are fairly large. This had led researchers to speculate about the tightness of the lower bound. The work in this paper sheds some more light on this problem as a 4n-step algorithm is shown to exist for a model slightly more powerful than the conventional SIMD model. Moreover, this algorithm has a running time of 3n steps on the more powerful MIMD model, which is “truly” optimal for such a model. Index Terms-Distance bound, lower bound, mesh-connected network, parallel algorithm, sorting, time complexity, upper bound. I. INTRODUCTION WO-DIMENSIONAL sorting is defined as the ordering of T a rectangular array of numbers such that every element is routed to a distinct position of the array predetermined by some indexing scheme. Some of the standard indexing schemes are illustrated in Fig. 1. The simplest computational model onto which this problem can be mapped is the mesh- connected processor array (mesh for short). The simplicity of the interconnection pattern, and the locality of communica- tion, makes the mesh easy to build and program and was the basis of one of the earliest parallel computers (ILLIAC IV). Since then, there have been more machines built on a much larger scale including the MPP and the DAPP using similar interconnection patterns. This simple architecture further motivates the idea of dealing with a given set of numbers as a rectangular array rather than as a linear sequence. More recently, Scherson [15] and Tseng et al. [22] have indepen- dently proposed a network which they call the orthogonal access architecture and the reduced-mesh network, respec- tively. It consists of p processors which are connected by a shared memory of p-q x p-q locations, where each Manuscript received August 29, 1986; revised February 15, 1988. I. D. Scherson is with the Department of Electrical Engineering, Princeton S. Sen is with the Department of Computer Science, Duke University, IEEE Log Number 8824537. University, Princeton, NJ 08544. Durham, NC 27706. (b) g 13 Fig. 1. Some indexing schemes. (a) Row major, @) snake-like row major, (c) shuffled row major. processor can randomly access a row or a column of size q independently. The sequential complexity of sorting has been studied extensively for well over two decades (for a very interesting account of the history of the development of various sorting methods, see [4]) but only recently has the complexity of parallel sorting received much attention. Although several interesting results have been obtained for various PRAM models (for example, see Reischuk [12], Cole [2]), the existence of a practical O(n) processor O(1og n) depth sorting network remains unresolved. The O(1og n) depth AKS network ([ 11) and the subsequent improvement in processor bound by Leighton [8] are primarily of theoretical importance. A trivial lower bound for sorting on any network is imposed by the diameter of the network which implies an Q(n) time complexity for sorting on an n x n mesh ’. The restriction on parallelism imposed by the nearest-neighbor type interconnec- tion results in inferior performance of algorithms on the mesh versus networks like the shuffle-exchange, the hypercube, etc., which have smaller diameters due to more complicated interconnection patterns. This is a price that one pays for simplicity of the network interconnections which may be worthwhile for a large number of applications. In particular, the recent interest in systolic implementations is based on a family of nearest-neighbor type interconnections of which the mesh-connectedprocessors is one of the simplest assemblages. One of earliest results for sorting on rectangular arrays of numbers was published by Thompson and Kung [20]. The ’ In future, all references to “mesh” will imply an n x n array of processors. All logarithms are to the base 2. OO18-9340/89/0200-0238$01 .OO O 1989 IEEE