Samarth Kaushik, Amit Kumar Singh, Thambipillai Srikanthan Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore {samarth2, amit0011, astsrikan}@ ntu.edu.sg Abstract— Design-time strategies are suited only for mapping predefined set of applications and thus cannot predict dynamic behavior. This dynamism demands run-time mapping of application tasks to maintain a critical balance between performance and resource optimization. This paper proposes a run-time heuristic that intelligently distributes the application tasks among multiple processors taking communication overhead, computation load and resource utilization in consideration. Keywords: Multiprocessor System-on-Chip (MPSoC), Network-on-Chip (NoC), Mapping Algorithms. I. INTRODUCTION System-on-Chip (SoC) design is experiencing a radical shift from uni-processor architecture to multi-processor architecture in order to adjust with the ever increasing demand for high performance. The rising complexity of real-life applications cannot be addressed by simply trying to make single-core processors run faster, instead it requires multiple processors, connected with a Network-on-Chip (NoC), which can cohesively communicate and provide increased concurrency [1]. The challenge is to map parallelized tasks of an application onto MPSoC platform, which entails a judicious mechanism of mapping these tasks on various processing elements (PEs), either at design-time or at run-time. Numerous design-time mapping techniques have been developed but they are limited to predefined set of applications and are unaware of run-time resource management [2], whereas run-time mapping techniques can be employed to large number of applications and incorporate run-time resource management. In [3], Holzenspies et al. propose a run-time strategy for mapping inherently parallel streaming applications on MPSoC. Singh et al. [4] describe a communication aware run-time mapping heuristic for MPSoC platforms accommodating multiple tasks on a single PE. The heuristic tries to minimize the communication overhead between two highly communicating tasks by mapping them on the same PE. However, existing heuristics does not attempt to balance the computation load on each PE utilized for mapping and also involves a restricted approach for minimizing communication overhead. We present a run-time task mapping technique that reduces computation load variance and delineates substantial performance improvements along with efficient resource utilization. II. PROPOSED ALGORITHM Our technique performs pre-processing of the application graph before actual mapping is done in order to reduce the communication overhead and improve the load balancing on various platform PEs, taking available memory on PEs into consideration. Application Model. An application is modeled as a set of communicating parallel processes represented as a task graph. The task graph is denoted as a directed graph ATG = (T, E), where T is a set of application tasks and E is the set of all edges in the application, connecting the tasks and representing their communication as shown in Figure 1. A task t i T is represented as (t id , t comp ), where t id is the task identifier and t comp is the task computation load in cycles. An edge e i E connecting the two tasks contains zcommunication information (t comm ) between the tasks. t comm represents the number of cycles taken for transferring a single token when full channel bandwidth is available. Platform Model. The MPSoC architecture is a graph AG = (P, C), where P is the set of PEs identified by its identifier p id and C represents the on chip communication channels for interconnecting the PEs. The PEs are connected in 4×4 mesh topology by a NoC. Among the available PEs, one is used as Manager Processor that is responsible for managing task operations and resources usage, including run-time management of task loads. Mapping. Task mapping is represented by function mpg: t i T p i P, which maps each task of the application on the platform PEs. A. Pre-Processing The technique tries to minimize communication latencies among various tasks of the application while simultaneously trying to balance the processing load on various PEs. The scheme starts by targeting the communication intensive edges in the application and attempts to merge these highly communicating tasks on the same PE. The merging operation takes place only if memory constraint of the involved PE is satisfied, i.e., the PE must have sufficient memory to accommodate both the tasks and shared memory for their local communication. The shared memory is required by communication data on the edge of the connecting tasks. The proposed strategy forms a global approach as complete application graph is seen in entirety for removing communication bottlenecks, in contrast to mapping technique in [4] where merging of communicating tasks takes place during execution. The main purpose of Preprocessing-based Run-time Mapping of Applications on NoC-based MPSoCs 2011 IEEE Computer Society Annual Symposium on VLSI 978-0-7695-4447-2/11 $26.00 © 2011 IEEE DOI 10.1109/ISVLSI.2011.43 335 2011 IEEE Computer Society Annual Symposium on VLSI 978-0-7695-4447-2/11 $26.00 © 2011 IEEE DOI 10.1109/ISVLSI.2011.43 337