Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch

The continued growth of LLMs capability, fueled by increasing parameter counts and support for longer contexts, has led to their usage in a wide variety of applications, each with diverse deployment requirements. For example, a chatbot supports a small number of users at very low latencies for good interactivity. Meanwhile, synthetic data generation requires high throughput to process many items at once. Delivering optimal inference performance across a wide range of use cases with one platform requires optimization across the entire technology stack.

Cutting-edge LLMs, like Llama 3.1 405B, require multiple GPUs working together for peak performance. To effectively use multiple GPUs for processing inference requests, an inference software stack must provide developers with optimized implementations of key parallelism techniques, including tensor, pipeline, and expert parallelism. These parallelism techniques require that GPUs be able to transfer data quickly and efficiently, necessitating a robust GPU-to-GPU interconnect fabric for maximum performance.

In this post, we explain two of these parallelism techniques and show, on an NVIDIA HGX H200 system with NVLink and NVSwitch, how the right parallelism increases Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios. We also show how use of pipeline parallelism enabled a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark on HGX H100 compared to our results published in August. These improvements are possible due to recent software improvements in TensorRT-LLM with NVSwitch.

Llama 3.1 405B Output Tokens/second (higher is better)		Parallelism
		Tensor	Pipeline
Scenario	minimum latency	56	10
Scenario	maximum throughput	506	764

MLPerf Inference Output Tokens/second (higher is better)		Parallelism
		Tensor Parallelism	Pipeline Parallelism
Scenario	Llama 2 70B	24,525	29,741

Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch

Choosing parallelism for deployment

Pipeline parallelism delivers 1.2x boost on MLPerf on H100

Tensor and pipeline parallelism explained

NVLink Switch helps maximize high-throughput performance

The NVIDIA platform is advancing at the speed of light

Evaluating Medical RAG with NVIDIA AI Endpoints and Ragas

Perplexity lets you search your internal enterprise files and the web

How to Fine-Tune a FLUX Model in under an hour with AI Toolkit and a DigitalOcean H100 GPU

New Spear-Phishing Campaign Deploys ‘More_eggs’ Backdoor

Physical Edition Release Coming Soon!

Related articles

Surgent Studios puts staff on hiatus while searching for a publisher

218 Layers With Superior Scaling

Microsoft Cuts Off Azure OpenAI Access for Chinese Developers

Kioxia Demonstrates Optical Interface SSDs for Data Centers