Corresponding author: Oluwatosin Oladayo ARAMIDE. Copyright © 2024 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution License 4.0. Ultra Ethernet vs. InfiniBand for AI/ML Clusters: A comparative study of performance, cost and ecosystem viability Oluwatosin Oladayo ARAMIDE * Network and storage layer, Netapp Ireland Limited, Ireland. Open Access Research Journal of Science and Technology, 2024, 12(02), 169-179 Publication history: Received on 09 November 2024; revised on 24 November 2024; accepted on 29 November 2024 Article DOI: https://doi.org/10.53022/oarjst.2024.12.2.0149 Abstract The use of artificial intelligence (AI) and machine learning (ML) workloads is becoming increasingly complex and requires increasingly large volumes of data, which has increased the need and importance of high-performance interconnects in training and inference clusters. Two emerging and competing technologies powerful enough to facilitate low latency and high-bandwidth communication of a distributed AI system have emerged: InfiniBand and the new Ultra Ethernet. The comparison contained in this paper provides a detailed discussion of major dimensions that should matter most to any AI/ML infrastructure using technologies: latency, throughput, potential scalability, cost- effectiveness, and maturity of vendor support. One is InfiniBand, a highly established and well-proven interconnect used in high-performance computing (HPC) that enjoys lossless transport, remote direct memory access (RDMA) and compatibility with common AI frameworks. Ultra Ethernet on the contrary extends the Ethernet stack, and in the same context, has provided some modifications to support AI workloads such as congestion control and load balancing. Using simulated benchmarks and practical cluster setups, we can analyze the performance profile of the two technologies through some typical AI/ML workloads, including model training using large language models and distributed inference. The results of our comparison produce clear trade-offs: InfiniBand has the best latency performance in tightly coupled training platforms, whereas Ultra Ethernet appears to have scalability, eco-system openness and cost-performance ratio advantages especially in cloud-scale workflows. The paper also looks at total cost of ownership (TCO), compatibility with software tooling and vulnerability to lock-in by a single vendor. In placing these findings in the context of the recent deployment trends and roadmaps in future technologies, this study provides a pragmatic set of guidelines to AI infrastructure architects during consideration of superior networking, given a high throughput and large-scale ML deployment. Keywords: Ultra Ethernet; InfiniBand; AI/ML Clusters; High-Performance Interconnects; Cost Analysis; HPC Networking 1. Introduction Les Expanded workloads on artificial intelligence (AI) and machine learning (ML) have challenged the nature of performance requirements on computing infrastructure. Current AI ML HPC clusters need to deal with petabytes of data, train foundation models containing billions of parameters, and enable inference at edges and the cloud in real- time. These workloads not only demand high computational throughput, but also ultra-low-latency, high-bandwidth networking which make efficient communications between distributed nodes possible. Consequently, the interconnect fabric has developed as a serious bottleneck and distinguishing aspect in the layout and performance of scale-out AI/ML frameworks (Ramachandran et al., 2021; Bonati, 2022).