Late Breaking Results: A Fast and Low-Cost Comparison-Free Sorting Engine with Unary Computing Amir Hossein Jalilvand * , Seyedeh Newsha Estiri * , Samaneh Naderi + , M. Hassan Najaf * , and Mohsen Imani † * University of Louisiana at Lafayette, + Iran University of Science and Technology, † University of California Irvine Corresponding Author: najaf@louisiana.edu ABSTRACT Hardware-efcient implementation of sorting operation is cru- cial for numerous applications, particularly when fast and energy- efcient sorting of data is desired. Unary computing has been used for low-cost hardware sorting. This work proposes a comparison- free unary sorting engine by iteratively fnding maximum values. Synthesis results show up to 81% reduction in hardware area com- pared to the state-of-the-art unary sorting design. By processing right-aligned unary bit-streams, our unary sorter is able to sort many inputs in fewer clock cycles. 1 INTRODUCTION Sorting is essential for numerous applications, including image pro- cessing, artifcial intelligence, task scheduling, scientifc computing, etc. For high-performance sorting, sorting is performed in hardware with application-specifed integrated circuits or feld-programmable gate arrays. Hardware-based sorting is fundamentally diferent from software-based sorting such as QuickSort, MergeSort, Bubble- Sort, etc. In software sorting, the order of comparisons depends on data. But, in hardware sorting, this order is fxed and is independent of data. The number of sorting operations can vary signifcantly from application to application. For example, in image processing applications, thousands of inputs may need to be sorted. Therefore, an optimal hardware implementation of sorting operation is of great importance. There is a relatively large body of work for hardware-based sorting [3]. The ultimate goal is to sort data with minimum latency and hardware cost. One of the most popular approaches is Batcher’s sorting [3 ]. Batcher wires up a network of compare-and-swap (CAS) units, which can be pipelined easily. The hardware cost and the power consumption of Batcher’s network depend on the number of CAS blocks and the cost of each CAS block. Each CAS block compares two input values and swaps the values at the output if needed. The total number of CAS blocks in an  -input Batcher’s sorting is  × 2 (  )×( 2 (  )+ 1)/4. Thus, 8-, 16-, 32-, and 256- input Batcher networks require 24, 80, 240, and 4,608 CAS blocks, respectively [2]. Batcher’s sorting is conventionally implemented based on the weighted binary representation. Binary representation is compact; however, computation on this representation is relatively complex. The complexity increases by increasing the data-width. Increasing the complexity afects the cost of hardware implementation, latency, power, and hence, energy consumption. Najaf et al. proposed an alternative low-cost hardware design for Batcher’s networks using unary computing [ 5]. In unary computing, numbers are encoded uniformly by a sequence of one value (say 1) followed by a sequence of the other value (say 0) with the data value determined by the fraction of 1’s in the sequence. For example, 11000 is a left-aligned unary sequence (i.e., bit-stream) representing 0.4. The minimum and the maximum value functions, the essential functions in building Batcher sorting networks, can be realized efciently in the unary domain using simple bit-wise AND and OR operations. An area and power saving of up to 92% is reported in [ 5] for the unary Batcher sorting design compared to the conventional binary counterpart. The hardware design of a comparison-free sorting engine is proposed in [ 4]. Their design sorts  data elements in nearly  clock cycles while recognizing the maximum number in the 1st In[n] Engine Address of largest element ... Cont Out[1] Out[2] Out[n] ... Sorted data Input r Multiplexer Data ... Address Detection Signal (ds) Fig. 1: High-level Architecture of Comparison-Free Unary Sorter. clock cycle. Their sorting engine is constructed by employing  symmetric cascaded blocks, and sorting operations are performed in a pipelined fashion. A comparison-free sorting algorithm is also introduced in [ 1 ]. This design can be applied to any data distribution with no signifcant adjustment. The number of required cycles falls in the range of 2  to 2  + 2  1, where  is the bit-width of data and  is the number of input data. This work proposes a fast and low-cost comparison-free sorting architecture based on unary computing. We iteratively fnd the index of the maximum value by converting data to left-aligned unary bit-streams and fnding the frst ł1ž in the generated bit- streams. Our synthesis results show a signifcant area reduction, up to 81%, compared to the state-of-the-art unary sorting design of [ 5] and up to 45% compared to the comparison-free design of [ 4]. The proposed sorter sorts many inputs in fewer clock cycles compared to the unary design of [5]. 2 COMPARISON-FREE UNARY SORTER Here we describe our proposed comparison-free unary sorting de- sign. The high-level architecture is shown in Fig. 1. The architecture includes a sorting engine, a controller, and a multiplexer. The design reads unsorted data from the input registers and performs sorting by fnding the address of the maximum number at each step. Fig. 2 shows the proposed sorting engine. In the frst step, the sorting engine converts data to right-alighted unary bit-streams and re- turns the index of the bit-stream corresponding to the maximum value. This is done by fnding the bit-stream that produces the frst 1. Consider a set of inputs,  1 = 0. 4,  2 = 0. 2,  3 = 0 . 8,  4 = 0. 6, 5 = 0 . 2, and  6 = 0. 8. A right-aligned unary representation for these numbers is  1 = 00011,  2 = 00001,  3 = 01111,  4 = 00111, 5 = 00001 and  6 = 01111. In the frst cycle, the shared down counter starts counting down and a zero bit is generated for all inputs. In the second cycle, a one is generated for the third (  3 ) and the last (  6) input. This enables the fip-fops corresponding to the third and last inputs. When these fip-fops are activated, the detection signal (ds ), which is the output of an addition unit, will have a value of two. ds = 2 means that the next maximum value is not a single number but two numbers with the same value. We utilize a priority encoder to obtain the memory address of one of the maximum values in the second cycle. Next, ds is passed to the controller. The controller’s fnite state machine is shown in Fig. 3. When  = 2, the state changes from "Find the index" to "Put the results." The state does not change until the two numbers are in DAC '22, July 10–14, 2022, San Francisco, CA, USA © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9142-9/22/07…$15.00 https://doi.org/10.1145/3489517.3530615 1390