I/O Architecture, Substrate Design, and Bonding
Process for a Heterogeneous Dielet-Assembly based
Waferscale Processor
Saptadeep Pal
∗
, Irina Alam
∗
, Krutikesh Sahoo
∗
, Haris Suhail
∗
, Rakesh Kumar
†
, Sudhakar Pamarti
∗
,
Puneet Gupta
∗
and Subramanian S. Iyer
∗
∗
Electrical and Computer Engineering, University of California, Los Angeles, CA 90095, USA
†
Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL 61820, USA
Email: saptadeep@ucla.edu
Abstract—Demand for large amounts of parallelism is growing
rapidly for today’s computing systems. This is due to the prolif-
eration of applications such as graph processing, data analytics,
machine learning, etc. which require a large number of processing
cores and a large amount of memory bandwidth. Often systems
comprising of many individual packaged chips are employed to
run these applications. However, inter-package communication
has not scaled well and this bottleneck threatens the performance
scaling of these applications. One way to alleviate this bottleneck
is to build waferscale processors where many compute cores
and memory blocks can communicate efficiently at very high
bandwidths. In this work, we attempt to build a many-core
waferscale processor using heterogeneous dielet assembly on the
Silicon Interconnect Fabric (Si-IF) technology. The design and
implementation of a dielet based waferscale processor have their
own set of challenges. Some of the challenges include (1) design
of area and energy efficient highly parallel I/O cells, (2) Si-IF
substrate design and its impact on signaling and power delivery,
and (3) reliable and efficient dielet-to-wafer bonding process. In
this work, we will discuss the solutions to these three challenges
that we employed in our dielet and Si-IF substrate design. Our
custom-designed I/O cell is only 157.8μm
2
, which is 95% smaller
than the standard cell I/Os and consumes only about 0.075pJ/bit.
We co-designed the dielets with the Si-IF substrate to ensure that
we can achieve on-chip like communication characteristics for
inter-dielet communication. This helps us to seamlessly partition
a large design into fine-grained dielets. For delivering power to
the dielets across an entire wafer, we use power delivery from the
edge of the wafer. This scheme results in large resistive power
loss, and as a result, we designed a novel power management unit
on each tile to provide reliable power to the core circuitry in the
dielets. Lastly, we briefly discuss the copper-gold bonding process
and the heterogeneous dielet assembly scheme we developed for
an efficient and reliable assembly process. Shear tests show that
the bond strength achieved with this process is ∼113.3 MPa which
is >5x compared to the previously reported bond strength for
gold-gold bonding.
Keywords—Waferscale Processors, Silicon Interconnect Fabric,
Dielet Assembly
I. I NTRODUCTION
In recent times, there has been a rapid proliferation of highly
parallel workloads such as graph processing, data analytics,
machine-learning, etc. that are driving the need for a large
number of processing cores, large memory capacity and high
bandwidth in today’s high performance computing systems [1],
[2]. These applications are often run on systems comprising
of multiple discrete packaged processors connected using
conventional off-package communication links through PCBs
and between PCBs. The inter-package communication links
are one of the major bottlenecks in today’s systems due to their
much poorer energy efficiency and bandwidth compared to
that of the on-die links that is limiting the performance scaling
of these applications [3]. This is because though Moore’s Law
has helped shrink the on-chip features by >1000× over the
last four decades, off-chip package components have scaled
by merely about 4× [4]. To push higher bandwidth between
packaged components where interconnect wiring is sparse, data
rate per wire/link needs to be increased. This is done using
high-speed I/O circuitry in combination with serialization and
de-serialization (SerDes) schemes [5]. The SerDes circuitry is
used to convert low frequency parallel data interfaces inside
the dies to high-speed serialized interfaces required for high
bandwidth communication between the packages.
Such a SerDes based scheme comes with its own challenges.
First, the area taken up by the complex I/O circuitry to support
chip-to-chip communication is often large and already exceeds
25% on some of today’s processors and power overhead of such
I/O can often exceed 30% [4]. Moreover, large communication
latency is incurred which often results in significant bottlenecks
to multi-chip performance scaling. While consuming large
amount of power, area and latency, the off-chip bandwidth still
lags on-chip bandwidth by up to 50x. Recent advances in multi-
chip module (MCM) [6], [7] and interposer technology [8] have
targeted this mismatch and these technologies can integrate
multiple processor and memory dies tightly inside a package
by inserting a new level of inter-dielet interconnection which
provides high-bandwidth and low-latency. Examples of such
technologies includes TSMC CoWoS [9] and Intel’s EMIB [10].
Though these technologies alleviate some of the issues of
conventional single-die packages, they are still constrained by
the size limit and can accommodate only a few dies within one
package. A scale-out high performance system today therefore
needs to integrate many multi-die packages on a PCB or across
multiple PCBs to satisfy the compute and memory needs of
modern workloads. There again, the off-package intra-PCB and
298
2021 IEEE 71st Electronic Components and Technology Conference (ECTC)
2377-5726/21/$31.00 ©2021 IEEE
DOI 10.1109/ECTC32696.2021.00057
2021 IEEE 71st Electronic Components and Technology Conference (ECTC) | 978-1-6654-4097-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/ECTC32696.2021.00057
Authorized licensed use limited to: University of Illinois. Downloaded on August 17,2021 at 21:09:15 UTC from IEEE Xplore. Restrictions apply.