NVIDIA Ethernet Acceleration xAI Builds World's Largest AI Supercomputer

NVIDIA Ethernet Acceleration xAI Builds World's Largest AI Supercomputer

Oct. 28, 2024—NVIDIA announced that xAI’s Colossus supercomputer cluster in Memphis, Tennessee, has reached a massive scale of 100,000 NVIDIA® Hopper GPUs. The cluster uses the NVIDIA Spectrum-X™ Ethernet networking platform, an RDMA (Remote Direct Memory Access) network designed to deliver exceptional performance for multi-tenant, hyperscale AI factories.

Colossus is the world’s largest AI supercomputer and is currently being used to train xAI’s Grok series of large language models, as well as its chatbot as part of the X Premium user feature. xAI is further doubling the size of Colossus to 200,000 NVIDIA Hopper GPUs.

xAI and NVIDIA built all the supporting facilities and this state-of-the-art supercomputer in just 122 days, and from the first rack landing to the start of training tasks, it took only 19 days. Building a system of this scale usually takes months or even years.

When training a very large model like Grok, Colossus achieved unprecedented network performance. Under the three-layer network architecture, the entire system did not experience any increase in application latency or packet loss due to traffic conflicts. With Spectrum-X's advanced congestion control function, the system data throughput remained at 95%.

This level of performance is simply unachievable at scale with traditional Ethernet, which can only deliver 60% of data throughput when thousands of flows collide.

“AI is becoming increasingly critical, placing greater demands on performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is purpose-built to enable innovators like xAI to process, analyze and execute AI workloads faster, accelerating the development, deployment and time to market of AI solutions.”

Elon Musk said at X: “Colossus is the most powerful training system in the world. Well done to the xAI team, NVIDIA, and our many partners and suppliers.”

“xAI builds the world’s largest and most powerful supercomputers,” said an xAI spokesperson. “With NVIDIA Hopper GPUs and Spectrum-X, we are able to push the boundaries of large-scale AI model training and build an AI factory that is super-accelerated and optimized based on Ethernet standards.”

At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds up to 800Gb/s and is powered by the Spectrum-4 switch ASIC. xAI uses an end-to-end solution that combines the Spectrum-X SN5600 switch with the NVIDIA BlueField-3® SuperNIC to achieve unprecedented performance.

Spectrum-X Ethernet networks specifically for AI have advanced features that deliver low latency and short tail latency while providing efficient, scalable bandwidth, features that were previously exclusive to InfiniBand networks. Spectrum-X features include dynamic routing based on NVIDIA DDP (Direct Data Placement) technology, congestion control calculations, and enhanced visibility and performance isolation for AI networks, all of which are key requirements for multi-tenant generative AI clouds and large-scale enterprise application environments.

<<: 

>>:  Traffic scheduling: DNS, full-site acceleration and computer room load balancing

Recommend

Can 5G RedCap technology help operators regain confidence?

As my country has built the world's largest 5...

5G Ready: Enabling Technology to Prepare for the Future of Work

5G opens the door to major technological advances...

What is 5G IoT?

What is non-cellular 5G? I imagine most readers a...

AT&T is offering six months of Stadia Pro for free to 5G and fiber customers

Carriers are expanding their growing list of free...

Network literacy: Understanding DNS in one article

[[328762]] Hello everyone, I am Brother Ming. Dur...

Is working from home a good idea? See which companies are hiring remote developers

【51CTO.com Quick Translation】 When you encounter ...