Scalability Investigation of Mat-Core Processor
Mostafa I. Soliman
Computers & Systems Section, Electrical Engineering Dep.,
Aswan Faculty of Engineering, South Valley University,
Aswan 81542, Egypt
Abdulmajid F. Al-Junaid
Computers & Systems Section, Electrical Engineering Dep.,
Faculty of Engineering, Assiut University,
Assiut 71515, Egypt
Abstract—Mat-Core is a research processor aiming at
exploiting the increasingly number of transistors per IC to
improve the performance of a wide range of applications. It
extends a general-purpose scalar processor with a matrix unit
for processing vector/matrix data. The extended matrix unit is
decoupled into two components to hide memory latency:
address generation and data computation, which communicate
through data queues. This paper investigates the scalability of
Mat-Core architecture with different number of parallel lanes
(one, four, and eight) on some linear algebra kernels. These
kernels include scalar-vector multiplication, SAXPY, Givens
rotation, rank-1 update, vector-matrix multiplication, and
matrix-matrix multiplication. A cycle accurate model of Mat-
Core processor is implemented using SystemC (system level
modeling language).
Four versions of Mat-Core processor are implemented and
evaluated to show its scalability. These versions include Mat-
Core with single lane and 8-element vector registers, four lanes
with 4×4 matrix registers, four lanes with 8×4 matrix registers,
and eight lanes with 8×8 matrix registers. The first version
(single lane with 8-element vector registers) exploits only scalar
and vector ISA whereas the other versions can exploit the three
levels of Mat-Core ISA (scalar/vector/matrix ISA). Our results
show that increasing the number of parallel lanes from one to
four and then to eight speeds up the execution of the six kernels
by factors of 3.6x-4.8x and 7.94x-10.6x, respectively, which
indicates the scalability of Mat-Core architecture. Moreover,
the maximum performance of the Mat-Core processor on
matrix-matrix multiplication represents 90% of the ideal value.
Index Terms— scalable architecture, high performance
computing, performance evaluation, vector/matrix processing.
I. INTRODUCTION
Scalability problem is considered as a major challenge for
microprocessor designers. Architecture scalability simply
means that a very large computer can be built from a large
number of basic components (computers, processors or
processing elements, memories, and switches) with no single
bottleneck component. Thus, the computer can be
increasingly expanded over its designed scaling range,
delivering linear incremental performance for a well-defined
set of applications. This paper investigates the scalability of
our proposed Mat-Core architecture with different number of
parallel lanes (one, four, and eight) on some linear algebra
kernels (scalar-vector multiplication, SAXPY: single-
precision scalar A times vector X plus vector Y, Givens
rotation, rank-1 update, vector-matrix multiplication, and
matrix-matrix multiplication). A cycle accurate model of
Mat-Core processor is implemented using SystemC (system
level modeling language) [1].
Mat-Core is a research processor aiming at exploiting the
increasingly number of transistors per IC to improve the
performance of a wide range of applications. It extends a
general-purpose scalar processor with a matrix unit for
processing vector/matrix data. The extended matrix unit is
decoupled into two components to hide memory latency:
data computation and address generation, which
communicate through data queues [2]. As in vector
processors [3]-[7], the data computation unit is organized in
parallel lanes. However, on these parallel lanes not only
vectors but also matrix data can be processed. Hence, Mat-
Core processor inherits from a vector processor design the
relatively straightforward means to scale performance. By
increasing the number parallel lanes, designer can easily
increase the amount of data-level parallelism exploited. This
also allows designers to easily scale the processor design to
exploit the increased number of transistors that continue to
grow according to Moore’s law [8].
Four versions of Mat-Core processor are implemented and
evaluated to show its scalability. These versions are different
in the number of parallel lanes and the size of registers in the
matrix unit of Mat-Core architecture. The first version
contains one lane with vector register length of eight-
element. It exploits only scalar and vector ISA. However, the
other versions can exploit the three levels of Mat-Core ISA
(scalar/vector/matrix ISA) [9]. The second and third versions
contain four lanes but they are different in the size of matrix
registers (4×4 and 8×4). These versions show that scaling the
matrix register size results in improving the performance of
Mat-Core with the same number of parallel lanes. This is
because larger matrix register size amortizes the pipeline
latency of functional units. The last version has eight lanes
with matrix registers of size 8×8.
This paper is organized as follows. The hardware
scalability of Mat-Core is discussed in detail in Section 2.
Section 3 describes the scalability in performance of linear
algebra kernels on Mat-Core architecture with variable
number of lanes. Finally, Section 4 concludes this paper.
II. HARDWARE SCALABILITY OF MAT-CORE PROCESSOR
To reduce the execution time, most vector processors use
parallel pipelines per functional unit [10]. Thus, a vector unit
can be structured as parallel lanes, where each lane contains
a portion of the vector register file and one pipeline for each
vector functional unit. The concept of parallel lanes is
fundamental for the vector microarchitecture, as it leads to
advantages in performance, design complexity, and
scalability.
There are several benefits to the modular, lane-based
implementation [11]. A single lane must be designed and
verified regardless of the number of lanes allocated in the
processor. Scaling the processor for processing longer
vectors or larger matrices by allocating the proper number of
lanes leads to balanced addition of both register file and
execution resources, without requiring redesign of functional
22nd International Conference on Microelectronics (ICM 2010)
978-1-4244-5816-5/09/$26.00 ©2009 IEEE