Corresponding author: Oluwatosin Oladayo ARAMIDE.
Copyright © 2024 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution License 4.0.
Ultra Ethernet vs. InfiniBand for AI/ML Clusters: A comparative study of
performance, cost and ecosystem viability
Oluwatosin Oladayo ARAMIDE
*
Network and storage layer, Netapp Ireland Limited, Ireland.
Open Access Research Journal of Science and Technology, 2024, 12(02), 169-179
Publication history: Received on 09 November 2024; revised on 24 November 2024; accepted on 29 November 2024
Article DOI: https://doi.org/10.53022/oarjst.2024.12.2.0149
Abstract
The use of artificial intelligence (AI) and machine learning (ML) workloads is becoming increasingly complex and
requires increasingly large volumes of data, which has increased the need and importance of high-performance
interconnects in training and inference clusters. Two emerging and competing technologies powerful enough to
facilitate low latency and high-bandwidth communication of a distributed AI system have emerged: InfiniBand and the
new Ultra Ethernet. The comparison contained in this paper provides a detailed discussion of major dimensions that
should matter most to any AI/ML infrastructure using technologies: latency, throughput, potential scalability, cost-
effectiveness, and maturity of vendor support.
One is InfiniBand, a highly established and well-proven interconnect used in high-performance computing (HPC) that
enjoys lossless transport, remote direct memory access (RDMA) and compatibility with common AI frameworks. Ultra
Ethernet on the contrary extends the Ethernet stack, and in the same context, has provided some modifications to
support AI workloads such as congestion control and load balancing. Using simulated benchmarks and practical cluster
setups, we can analyze the performance profile of the two technologies through some typical AI/ML workloads,
including model training using large language models and distributed inference.
The results of our comparison produce clear trade-offs: InfiniBand has the best latency performance in tightly coupled
training platforms, whereas Ultra Ethernet appears to have scalability, eco-system openness and cost-performance ratio
advantages especially in cloud-scale workflows. The paper also looks at total cost of ownership (TCO), compatibility
with software tooling and vulnerability to lock-in by a single vendor. In placing these findings in the context of the recent
deployment trends and roadmaps in future technologies, this study provides a pragmatic set of guidelines to AI
infrastructure architects during consideration of superior networking, given a high throughput and large-scale ML
deployment.
Keywords: Ultra Ethernet; InfiniBand; AI/ML Clusters; High-Performance Interconnects; Cost Analysis; HPC
Networking
1. Introduction
Les Expanded workloads on artificial intelligence (AI) and machine learning (ML) have challenged the nature of
performance requirements on computing infrastructure. Current AI ML HPC clusters need to deal with petabytes of
data, train foundation models containing billions of parameters, and enable inference at edges and the cloud in real-
time. These workloads not only demand high computational throughput, but also ultra-low-latency, high-bandwidth
networking which make efficient communications between distributed nodes possible. Consequently, the interconnect
fabric has developed as a serious bottleneck and distinguishing aspect in the layout and performance of scale-out AI/ML
frameworks (Ramachandran et al., 2021; Bonati, 2022).