Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance

Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).

As LLMs continue to grow in size – with the latest community models now featuring hundreds of billions of parameters – they deliver more accurate responses and also support even larger context windows to allow users to ask longer, information-rich queries. For example, the Llama 3.1 and 3.2 family of LLMs supports up to 128K token context windows, or roughly the length of a novel. These capabilities make LLMs more useful, but they also require more delivered parallel compute performance for good interactivity.

Until now, AI has scaled with model pre-training. Recent advances will also scale with post-training synthetic data generation and inference-time reasoning. Inference performance and scaling is now critically important.

In this post, we show how the NVIDIA GH200 NVL32 system, powered by 32 NVIDIA GH200 Grace Hopper Superchips connected using the NVLink Switch system, and with TensorRT-LLM improvements, scales to deliver outstanding TTFT for long-context inference using the latest Llama 3.1 70B and 405B models.

	Llama 3.1 70B Time to First token (milliseconds) (Lower is better)
Input Sequence Length	GH200 NVL32
4,096	64
32,768	472
122,880	2,197

	Llama 3.1 405B Time to First token (milliseconds) (Lower is better)
Input Sequence Length	GH200 NVL32
4,096	208
32,768	1,627
122,880	7,508

Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance

Time-to-first-token matters for real-time use cases

NVIDIA GH200 NVL32 supercharges TTFT for long context inference

Llama 3.1 70B

Llama 3.1 405B

Inference continues to be a hotbed of invention

Next up: accelerating agentic workflows

NVIDIA Blackwell GB200 NVL72 powers a new era of computing

Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy

AI Explores RNA Dark Matter and Finds 70,000 New Viruses

Nvidia’s Newest Foundation Model Can Actually Spell ‘Strawberry’

Beyond the Buzzwords, Building a Resilient Future

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs

Related articles

Bookshop Day: our 2024 recommendations

Kioxia Demonstrates Optical Interface SSDs for Data Centers

End of the Road: An AnandTech Farewell

Take Aim With This Glorious Ultraman Trading Card Art