Trident: Technology-Scalable Architecture for Data Parallel Applications Stanislav G. Sedukhin and Mostafa I. Soliman Graduate School of Computer Science and Engineering The University of Aizu, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan (sedukhin; d8031102)@u-aizu.ac.jp Abstract Within the current decade, process technology is promising more than one billion transistors on a single die, operating at frequency more than 10 GHz. We proposed the Trident processor, which uses multi-level ISA to express data parallelism to hardware. Trident is scalable because its architecture is regular, which can be widely replicated to efficiently harness the available transistor budget. Besides, it is based on local communication, which is very suitable for a high operating frequency of the future VLSI technology. This paper discusses the Trident processor architecture and evaluates its performance on the Basic Linear Algebra Subprograms (BLAS), which are widely used in many data parallel applications. The TFLOPS rate on infinite-size problems (R ), which is primarily a characteristic of the computer technology, and the problem size needed to reach one-half of R (N 1/2 ), which is a measure of the amount of parallelism in a computer architecture, are used to evaluate the performance of the Trident processor on BLAS. On 128 parallel Trident lanes and 10 GHz operating frequency, which are possible in the billion-transistor era, R of dot-product, matrix-vector, and matrix-matrix multiplications are 1.1, 1.8, and 2.5 TFLOPS, respectively. Besides, N 1/2 increases when switching from low level to high level of BLAS. 1. Introduction Although some questions have arisen about the continued validity of Moore’s law, which states that single-chip transistor counts double roughly every 18 months, the underlying semiconductor technology continues to improve significantly. Within the current decade, process technology is promising more than one billion transistors on a single die, operating at frequency more than 10 GHz [1, 2]. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. These constrains include interconnection problem (due to relatively slower scalability of wires compared to logic) [3], design and verification complexities (due to increase the capacity and functional complexity of processors) [4], and so called ”memory wall” (due to increase the performance gap between processor and memory) [5]. We proposed an approach to efficiently harness the available transistor budget and speedup a wide variety of data parallel applications. We use multi-level instruction set architecture (ISA) to express parallelism to hardware instead of the dynamical extraction by a complicated logic (superscalar architectures) or statically with compilers (VLIW architectures). Since the fundamental data structures for a wide variety of multimedia, scientific, and engineering applications are scalar, vector, and matrix, our Trident research processor has three-level ISA to provide a high level interface for programming and expressing parallelism to hardware. This leads to high performance, simple programming model, and compact executable code. Trident processor emphasizes on local communication to take the benefits of the future VLSI technology. Like vector architectures [6], Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor can effectively process not only vector but also matrix data on the parallel lanes [7]. Due to the natural scalability of the vector and matrix processing, the Trident processor can be scaled easily. A single lane is needed to design and verify, then replicating lanes can scale the Trident processor to process longer vectors or larger matrices [8]. This leads to reduce the design and verification complexities. Since the widely usage of the Basic Linear Algebra Subprograms (BLAS) in many data parallel applications, this paper presents their implementation and evaluation on the Trident processor. Level 1 [9], Level 2 [10], and Level 3 [11] BLAS define subroutines to perform basic vector-vector, matrix-vector, matrix-matrix operations, respectively. Level 1 BLAS has been very successful and 0-7695-1926-1/03/$17.00 (C) 2003 IEEE