This week the Linux Foundation has announced that the group will be overseeing the formation of a new Ethernet consortium, with a focus on adapting and refining the technology for high performance computing workloads. Backed by founding members AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft, the new Ultra Ethernet Consortium will be working to improve Ethernet to meet the low latency and scalability requirements that HPC and AI systems need – and which the group says current Ethernet technology isn't quite up to the task for.
The top priority of the new group will be to define and develop what they are calling the Ultra Ethernet Transport (UET) protocol, a new transport-layer protocol for Ethernet that will better address needs of AI and then HPC workloads.
Ethernet is certainly one of the most ubiquitous technologies around, but demands of AI and HPC clusters are growing so fast that the technology will run out of steam in the future. The size of large AI models is increasing rapidly. GPT-3 was trained with 175 billion of parameters back in 2020. Today GPT-4 is said to be accommodating already a trillion of parameters. Models with the larger number of parameters require larger clusters and then these clusters send larger messages over the network. As a result, the higher bandwidth and the shorter latency these network feature, the more efficient the cluster can operate.
"Many HPC and AI users are finding it difficult to obtain the full performance from their systems due to weaknesses in the system interconnect capabilities," said Dr. Earl Joseph, CEO of Hyperion Research.
At a high level, the new Ultra Ethernet Consortium is looking to refine Ethernet in a surgical manner, improving and altering only those bits and pieces necessary to achieve their goals. At its onset, the consortium is looking at improving both the software and physical layers of Ethernet technology — but without altering its basic structure to ensure cost efficiency and interoperability.
Technical goals of the consortium include developing specifications, APIs, and source code to define protocols, interfaces, and data structures for Ultra Ethernet communications. In addition, the consortium aims to update existing link and transport protocols and create new telemetry, signaling, security, and congestion mechanisms to better address needs of large AI and HPC clusters. Meanwhile, since AI and HPC workloads have a number of differences, UET will have separate profiles for appropriate deployments.
"Generative AI workloads will require us to architect our networks for supercomputing scale and performance," said Justin Hotard, executive vice president and general manager, HPC & AI, at Hewlett Packard Enterprise. "The importance of the Ultra Ethernet Consortium is to develop an open, scalable, and cost-effective ethernet-based communication stack that can support these high-performance workloads to run efficiently. The ubiquity and interoperability of ethernet will provide customers with choice, and the performance to handle a variety of data intensive workloads, including simulations, and the training and tuning of AI models."
The Ultra Ethernet Consortium is hosted by the Linux Foundation, though the real work will be undertaken by its members. Between AMD, Cisco, Intel, and other founders, these companies all either design high-performance CPUs, compute GPUs, and network infrastructure for AI and HPC workloads or build supercomputers or clusters for AI and HPC applications, thus have plenty of experience with appropriate technologies. The work of UEC is set to be conducted by four working groups that will work on Physical Layer, Link Layer, Transport Layer, and Software Layer.
And while the group is not explicitly talking about Ultra Ethernet in relation to any competing technologies, the members of the founding board – or rather, who's not a founding member – is telling. The performance goals and HPC focus of Ultra Ethernet would have it coming into direct competition with InfiniBand, which has for over a decade been the networking technology of choice for low-latency, HPC-style networks. While developed by its own trade association, NVIDIA is said to have an outsized influence on the group vis-a-vie their Mellanox acquisition a few years ago, and they are noticeably the odd man out of the new group. The company makes significant use of both Ethernet and InfiniBand internally, using both for their scalable DGX SuperPod systems.
As for the proposed Ultra Ethernet standards, UEC members are already plotting plans how to integrate the upcoming UET technology into their products.
"We are particularly encouraged by the improved transport layer of UEC and believe our portfolio is primed to take advantage of it," said Mark Papermaster, CTO of AMD in a blog post. "UEC allows for packet-spraying delivery across multiple paths without causing congestion or head-of-line blocking, which will enable our processors to successfully share data across clusters with minimal incast issues or the need for centralized load-balancing. Lastly, UEC accommodates built-in security for AI and HPC workloads that in turn help AMD capitalize on our robust security and encryption capabilities."
Meanwhile, for now UEC does not say when it expects to finalize the UET specification. It's expected that the group will seek certification from the IEEE, who maintains the various Ethernet standards, so there is an additional set of hoops to jump through there.
Finally, the UEC has noted that it is looking for additional members to round out the group, and will begin accepting new member applications from Q4 2023. Along with NVIDIA, there are several other tech giants involved in AI or HPC work that are not part of the group, so that would be their next best chance to join the consortium.
Source: The Linux Foundation, The Register
from AnandTech https://ift.tt/hn3zPuq
via IFTTT