I/O Architecture, Substrate Design, and Bonding Process for a Heterogeneous Dielet-Assembly based Waferscale Processor Saptadeep Pal ∗ , Irina Alam ∗ , Krutikesh Sahoo ∗ , Haris Suhail ∗ , Rakesh Kumar † , Sudhakar Pamarti ∗ , Puneet Gupta ∗ and Subramanian S. Iyer ∗ ∗ Electrical and Computer Engineering, University of California, Los Angeles, CA 90095, USA † Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL 61820, USA Email: saptadeep@ucla.edu Abstract—Demand for large amounts of parallelism is growing rapidly for today’s computing systems. This is due to the prolif- eration of applications such as graph processing, data analytics, machine learning, etc. which require a large number of processing cores and a large amount of memory bandwidth. Often systems comprising of many individual packaged chips are employed to run these applications. However, inter-package communication has not scaled well and this bottleneck threatens the performance scaling of these applications. One way to alleviate this bottleneck is to build waferscale processors where many compute cores and memory blocks can communicate efﬁciently at very high bandwidths. In this work, we attempt to build a many-core waferscale processor using heterogeneous dielet assembly on the Silicon Interconnect Fabric (Si-IF) technology. The design and implementation of a dielet based waferscale processor have their own set of challenges. Some of the challenges include (1) design of area and energy efﬁcient highly parallel I/O cells, (2) Si-IF substrate design and its impact on signaling and power delivery, and (3) reliable and efﬁcient dielet-to-wafer bonding process. In this work, we will discuss the solutions to these three challenges that we employed in our dielet and Si-IF substrate design. Our custom-designed I/O cell is only 157.8μm 2 , which is 95% smaller than the standard cell I/Os and consumes only about 0.075pJ/bit. We co-designed the dielets with the Si-IF substrate to ensure that we can achieve on-chip like communication characteristics for inter-dielet communication. This helps us to seamlessly partition a large design into ﬁne-grained dielets. For delivering power to the dielets across an entire wafer, we use power delivery from the edge of the wafer. This scheme results in large resistive power loss, and as a result, we designed a novel power management unit on each tile to provide reliable power to the core circuitry in the dielets. Lastly, we brieﬂy discuss the copper-gold bonding process and the heterogeneous dielet assembly scheme we developed for an efﬁcient and reliable assembly process. Shear tests show that the bond strength achieved with this process is ∼113.3 MPa which is >5x compared to the previously reported bond strength for gold-gold bonding. Keywords—Waferscale Processors, Silicon Interconnect Fabric, Dielet Assembly I. I NTRODUCTION In recent times, there has been a rapid proliferation of highly parallel workloads such as graph processing, data analytics, machine-learning, etc. that are driving the need for a large number of processing cores, large memory capacity and high bandwidth in today’s high performance computing systems [1], [2]. These applications are often run on systems comprising of multiple discrete packaged processors connected using conventional off-package communication links through PCBs and between PCBs. The inter-package communication links are one of the major bottlenecks in today’s systems due to their much poorer energy efﬁciency and bandwidth compared to that of the on-die links that is limiting the performance scaling of these applications [3]. This is because though Moore’s Law has helped shrink the on-chip features by >1000× over the last four decades, off-chip package components have scaled by merely about 4× [4]. To push higher bandwidth between packaged components where interconnect wiring is sparse, data rate per wire/link needs to be increased. This is done using high-speed I/O circuitry in combination with serialization and de-serialization (SerDes) schemes [5]. The SerDes circuitry is used to convert low frequency parallel data interfaces inside the dies to high-speed serialized interfaces required for high bandwidth communication between the packages. Such a SerDes based scheme comes with its own challenges. First, the area taken up by the complex I/O circuitry to support chip-to-chip communication is often large and already exceeds 25% on some of today’s processors and power overhead of such I/O can often exceed 30% [4]. Moreover, large communication latency is incurred which often results in signiﬁcant bottlenecks to multi-chip performance scaling. While consuming large amount of power, area and latency, the off-chip bandwidth still lags on-chip bandwidth by up to 50x. Recent advances in multi- chip module (MCM) [6], [7] and interposer technology [8] have targeted this mismatch and these technologies can integrate multiple processor and memory dies tightly inside a package by inserting a new level of inter-dielet interconnection which provides high-bandwidth and low-latency. Examples of such technologies includes TSMC CoWoS [9] and Intel’s EMIB [10]. Though these technologies alleviate some of the issues of conventional single-die packages, they are still constrained by the size limit and can accommodate only a few dies within one package. A scale-out high performance system today therefore needs to integrate many multi-die packages on a PCB or across multiple PCBs to satisfy the compute and memory needs of modern workloads. There again, the off-package intra-PCB and 298 2021 IEEE 71st Electronic Components and Technology Conference (ECTC) 2377-5726/21/$31.00 ©2021 IEEE DOI 10.1109/ECTC32696.2021.00057 2021 IEEE 71st Electronic Components and Technology Conference (ECTC) | 978-1-6654-4097-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ECTC32696.2021.00057 Authorized licensed use limited to: University of Illinois. Downloaded on August 17,2021 at 21:09:15 UTC from IEEE Xplore. Restrictions apply.