Late Breaking Results: A Fast and Low-Cost Comparison-Free
Sorting Engine with Unary Computing
Amir Hossein Jalilvand
*
, Seyedeh Newsha Estiri
*
, Samaneh Naderi
+
, M. Hassan Najaf
*
, and Mohsen Imani
†
*
University of Louisiana at Lafayette,
+
Iran University of Science and Technology,
†
University of California Irvine
Corresponding Author: najaf@louisiana.edu
ABSTRACT
Hardware-efcient implementation of sorting operation is cru-
cial for numerous applications, particularly when fast and energy-
efcient sorting of data is desired. Unary computing has been used
for low-cost hardware sorting. This work proposes a comparison-
free unary sorting engine by iteratively fnding maximum values.
Synthesis results show up to 81% reduction in hardware area com-
pared to the state-of-the-art unary sorting design. By processing
right-aligned unary bit-streams, our unary sorter is able to sort
many inputs in fewer clock cycles.
1 INTRODUCTION
Sorting is essential for numerous applications, including image pro-
cessing, artifcial intelligence, task scheduling, scientifc computing,
etc. For high-performance sorting, sorting is performed in hardware
with application-specifed integrated circuits or feld-programmable
gate arrays. Hardware-based sorting is fundamentally diferent
from software-based sorting such as QuickSort, MergeSort, Bubble-
Sort, etc. In software sorting, the order of comparisons depends on
data. But, in hardware sorting, this order is fxed and is independent
of data. The number of sorting operations can vary signifcantly
from application to application. For example, in image processing
applications, thousands of inputs may need to be sorted. Therefore,
an optimal hardware implementation of sorting operation is of
great importance.
There is a relatively large body of work for hardware-based
sorting [3]. The ultimate goal is to sort data with minimum latency
and hardware cost. One of the most popular approaches is Batcher’s
sorting [3 ]. Batcher wires up a network of compare-and-swap (CAS)
units, which can be pipelined easily. The hardware cost and the
power consumption of Batcher’s network depend on the number
of CAS blocks and the cost of each CAS block. Each CAS block
compares two input values and swaps the values at the output if
needed. The total number of CAS blocks in an -input Batcher’s
sorting is ×
2
( )×(
2
( )+ 1)/4. Thus, 8-, 16-, 32-, and 256-
input Batcher networks require 24, 80, 240, and 4,608 CAS blocks,
respectively [2].
Batcher’s sorting is conventionally implemented based on the
weighted binary representation. Binary representation is compact;
however, computation on this representation is relatively complex.
The complexity increases by increasing the data-width. Increasing
the complexity afects the cost of hardware implementation, latency,
power, and hence, energy consumption. Najaf et al. proposed an
alternative low-cost hardware design for Batcher’s networks using
unary computing [ 5]. In unary computing, numbers are encoded
uniformly by a sequence of one value (say 1) followed by a sequence
of the other value (say 0) with the data value determined by the
fraction of 1’s in the sequence. For example, 11000 is a left-aligned
unary sequence (i.e., bit-stream) representing 0.4. The minimum and
the maximum value functions, the essential functions in building
Batcher sorting networks, can be realized efciently in the unary
domain using simple bit-wise AND and OR operations. An area and
power saving of up to 92% is reported in [ 5] for the unary Batcher
sorting design compared to the conventional binary counterpart.
The hardware design of a comparison-free sorting engine is
proposed in [ 4]. Their design sorts data elements in nearly
clock cycles while recognizing the maximum number in the 1st
In[n]
Engine
Address of
largest
element
...
Cont
Out[1]
Out[2]
Out[n]
...
Sorted data
Input r
Multiplexer
Data
...
Address
Detection
Signal (ds)
Fig. 1: High-level Architecture of Comparison-Free Unary
Sorter.
clock cycle. Their sorting engine is constructed by employing
symmetric cascaded blocks, and sorting operations are performed
in a pipelined fashion. A comparison-free sorting algorithm is also
introduced in [ 1 ]. This design can be applied to any data distribution
with no signifcant adjustment. The number of required cycles falls
in the range of 2 to 2 + 2
1, where is the bit-width of data
and is the number of input data.
This work proposes a fast and low-cost comparison-free sorting
architecture based on unary computing. We iteratively fnd the
index of the maximum value by converting data to left-aligned
unary bit-streams and fnding the frst ł1ž in the generated bit-
streams. Our synthesis results show a signifcant area reduction, up
to 81%, compared to the state-of-the-art unary sorting design of [ 5]
and up to 45% compared to the comparison-free design of [ 4]. The
proposed sorter sorts many inputs in fewer clock cycles compared
to the unary design of [5].
2 COMPARISON-FREE UNARY SORTER
Here we describe our proposed comparison-free unary sorting de-
sign. The high-level architecture is shown in Fig. 1. The architecture
includes a sorting engine, a controller, and a multiplexer. The design
reads unsorted data from the input registers and performs sorting
by fnding the address of the maximum number at each step. Fig. 2
shows the proposed sorting engine. In the frst step, the sorting
engine converts data to right-alighted unary bit-streams and re-
turns the index of the bit-stream corresponding to the maximum
value. This is done by fnding the bit-stream that produces the frst
1. Consider a set of inputs,
1
= 0. 4,
2
= 0. 2, 3 = 0 . 8,
4
= 0. 6,
5 = 0 . 2, and 6 = 0. 8. A right-aligned unary representation for
these numbers is
1
= 00011,
2
= 00001, 3 = 01111,
4
= 00111,
5 = 00001 and 6 = 01111. In the frst cycle, the shared down
counter starts counting down and a zero bit is generated for all
inputs. In the second cycle, a one is generated for the third ( 3 )
and the last ( 6) input. This enables the fip-fops corresponding to
the third and last inputs. When these fip-fops are activated, the
detection signal (ds ), which is the output of an addition unit, will
have a value of two. ds = 2 means that the next maximum value
is not a single number but two numbers with the same value. We
utilize a priority encoder to obtain the memory address of one of
the maximum values in the second cycle. Next, ds is passed to the
controller. The controller’s fnite state machine is shown in Fig. 3.
When = 2, the state changes from "Find the index" to "Put the
results." The state does not change until the two numbers are in
DAC '22, July 10–14, 2022, San Francisco, CA, USA
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9142-9/22/07…$15.00
https://doi.org/10.1145/3489517.3530615
1390