Scalability Investigation of Mat-Core Processor Mostafa I. Soliman Computers & Systems Section, Electrical Engineering Dep., Aswan Faculty of Engineering, South Valley University, Aswan 81542, Egypt Abdulmajid F. Al-Junaid Computers & Systems Section, Electrical Engineering Dep., Faculty of Engineering, Assiut University, Assiut 71515, Egypt Abstract—Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. This paper investigates the scalability of Mat-Core architecture with different number of parallel lanes (one, four, and eight) on some linear algebra kernels. These kernels include scalar-vector multiplication, SAXPY, Givens rotation, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication. A cycle accurate model of Mat- Core processor is implemented using SystemC (system level modeling language). Four versions of Mat-Core processor are implemented and evaluated to show its scalability. These versions include Mat- Core with single lane and 8-element vector registers, four lanes with 4×4 matrix registers, four lanes with 8×4 matrix registers, and eight lanes with 8×8 matrix registers. The first version (single lane with 8-element vector registers) exploits only scalar and vector ISA whereas the other versions can exploit the three levels of Mat-Core ISA (scalar/vector/matrix ISA). Our results show that increasing the number of parallel lanes from one to four and then to eight speeds up the execution of the six kernels by factors of 3.6x-4.8x and 7.94x-10.6x, respectively, which indicates the scalability of Mat-Core architecture. Moreover, the maximum performance of the Mat-Core processor on matrix-matrix multiplication represents 90% of the ideal value. Index Terms— scalable architecture, high performance computing, performance evaluation, vector/matrix processing. I. INTRODUCTION Scalability problem is considered as a major challenge for microprocessor designers. Architecture scalability simply means that a very large computer can be built from a large number of basic components (computers, processors or processing elements, memories, and switches) with no single bottleneck component. Thus, the computer can be increasingly expanded over its designed scaling range, delivering linear incremental performance for a well-defined set of applications. This paper investigates the scalability of our proposed Mat-Core architecture with different number of parallel lanes (one, four, and eight) on some linear algebra kernels (scalar-vector multiplication, SAXPY: single- precision scalar A times vector X plus vector Y, Givens rotation, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language) [1]. Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: data computation and address generation, which communicate through data queues [2]. As in vector processors [3]-[7], the data computation unit is organized in parallel lanes. However, on these parallel lanes not only vectors but also matrix data can be processed. Hence, Mat- Core processor inherits from a vector processor design the relatively straightforward means to scale performance. By increasing the number parallel lanes, designer can easily increase the amount of data-level parallelism exploited. This also allows designers to easily scale the processor design to exploit the increased number of transistors that continue to grow according to Moore’s law [8]. Four versions of Mat-Core processor are implemented and evaluated to show its scalability. These versions are different in the number of parallel lanes and the size of registers in the matrix unit of Mat-Core architecture. The first version contains one lane with vector register length of eight- element. It exploits only scalar and vector ISA. However, the other versions can exploit the three levels of Mat-Core ISA (scalar/vector/matrix ISA) [9]. The second and third versions contain four lanes but they are different in the size of matrix registers (4×4 and 8×4). These versions show that scaling the matrix register size results in improving the performance of Mat-Core with the same number of parallel lanes. This is because larger matrix register size amortizes the pipeline latency of functional units. The last version has eight lanes with matrix registers of size 8×8. This paper is organized as follows. The hardware scalability of Mat-Core is discussed in detail in Section 2. Section 3 describes the scalability in performance of linear algebra kernels on Mat-Core architecture with variable number of lanes. Finally, Section 4 concludes this paper. II. HARDWARE SCALABILITY OF MAT-CORE PROCESSOR To reduce the execution time, most vector processors use parallel pipelines per functional unit [10]. Thus, a vector unit can be structured as parallel lanes, where each lane contains a portion of the vector register file and one pipeline for each vector functional unit. The concept of parallel lanes is fundamental for the vector microarchitecture, as it leads to advantages in performance, design complexity, and scalability. There are several benefits to the modular, lane-based implementation [11]. A single lane must be designed and verified regardless of the number of lanes allocated in the processor. Scaling the processor for processing longer vectors or larger matrices by allocating the proper number of lanes leads to balanced addition of both register file and execution resources, without requiring redesign of functional 22nd International Conference on Microelectronics (ICM 2010) 978-1-4244-5816-5/09/$26.00 ©2009 IEEE