Uniform Memory Architecture for GPU-CPU Resource Sharing in Edge AI Systems Shivakumar Udkar Senior Manager Design Engineering AMD Inc., Colorado, USA shiv.udkar@amd.com Muthukumaran Vaithianathan Senior Staff Engineer Engineering Samsung Semiconductor Inc., San Diego, USA muthu.v@samsung.com Vikas Gupta Senior Manager System Design AMD Inc., Texas, USA vikas.gupta@amd.com Manjunath Reddy Biometric Expert, Principal Engineer Qualcomm Inc., San Diego, USA reddym@qualcomm.com Abstract— To achieve all the requirements of a real-time system that demands low latency and high-throughput performances, edge AI systems must share resources efficiently over CPUs and GPUs. Explicit data transfers between different memory regions are thus the bottlenecks that also forbid higher performance while moving that complexity on a traditional discrete memory system. In this paper, the study introduces a new UMA that enables edge AI systems to share resources across GPUs and CPUs without any hitch. The proposed UMA reduces data transport overhead, enhances processing performance, and simplifies programming models by incorporating a unified address space, intelligent data placement, coherent caching, and improved memory allocation. This also guarantees optimal data placement and consistency between processing units as the technique dynamically analyzes access patterns of memory. Validating the effectiveness of the suggested strategy, experimental evaluations across multiple edge AI workloads bring about a latency reduction of up to 45% and throughput of 30%. This work lays the foundation for advancements in next-gen edge AI architectures, ushering in AI-powered edge applications that are both responsive and efficient. Keywords— Edge AI, Uniform Memory Architecture (UMA), CPU-GPU Resource Sharing, Cache Coherence I. INTRODUCTION Such critical enabling systems that enable AI in real- time in edge AI would be required more especially in domains like healthcare, autonomous cars, smart surveillance, and industrial automation [1]. These applications need smooth interaction between CPUs and GPUs to really handle complicated computations with low latency and high throughput. In older computer designs, the dedicated memory areas for central processing units (CPUs) and graphics processing units (GPUs) lead to delays in processing, bottlenecks, and lower efficiency due to the explicit transfers of data. The cost of memory management in such designs is formidable obstacles to AI systems running on the edge in real-time, where even small delays impair performance and decision-making [2]. An encouraging new approach to solving these problems is the UMA [3]. This allows for the sharing of memory between the central processing units and graphics processing units, thereby reducing data duplication, and improving access to memory. However, in most of today's implementations, improvements that suit the needs of edge AI systems are often not available. They have to be delicately balanced between power efficiency, limited hardware resources, and real-time processing demands. Besides unifying the memory space, any well-designed UMA for edge AI should incorporate intelligent data management techniques for dynamic resource allocation, cache consistency minimization, and processing performance enhancement [4]. Since the CPU and GPU run on different execution models and handle different types of workloads, the main challenge to CPU-GPU resource sharing is efficient management of memory access patterns [5]. Although GPUs happen to be great for batch calculations and parallel processing, they really shine when it comes to sequential and control-driven operations. Optimal placement of frequently accessed data, reductions in redundant memory fetches, and improved computational performance are some outcomes of a strong memory management strategy that arises from the wide variety of execution approaches. To ensure that data resides in the most optimal location with respect to computational intensity and reuse, any good UMA should have smart algorithms for data movement and placement [6]. The other significant consideration for efficient sharing of resources in edge AI systems is cache coherence. In traditional designs, ensuring consistency of the CPU and GPU caches typically incurs additional cost for synchronization, which badly hurts real-time performance. An effective component of any UMA is a cache coherence technique that guarantees data integrity and reduces synchronization latency [7]. By using an efficient cache management method, it can be prevented from letting CPUs and GPUs wait for out-of-date data, hence avoiding the replication of data [8]. A well-structured UMA simplifies the development of edge AI applications while improving performance. Memory allocation and data transfers between the CPUs and the GPUs are manually handled by the developers in the traditional programming methods, which might be a laborious and inefficient process [9]. The simplifications due to a shared memory space give programmers the leeway to focus on improving AI algorithms rather than the nitty-gritty details of memory management. 543