Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time that it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).
As LLMs continue to grow in size – with the latest community models now featuring hundreds of billions of parameters – they deliver more accurate responses and also support even larger context windows to allow users to ask longer, information-rich queries. For example, the Llama 3.1 and 3.2 family of LLMs supports up to 128K token context windows, or roughly the length of a novel. These capabilities make LLMs more useful, but they also require more delivered parallel compute performance for good interactivity.
Until now, AI has scaled with model pre-training. Recent advances will also scale with post-training synthetic data generation and inference-time reasoning. Inference performance and scaling is now critically important.
In this post, we show how the NVIDIA GH200 NVL32 system, powered by 32 NVIDIA GH200 Grace Hopper Superchips connected using the NVLink Switch system, and with TensorRT-LLM improvements, scales to deliver outstanding TTFT for long-context inference using the latest Llama 3.1 70B and 405B models.
Time-to-first-token matters for real-time use cases
Applications such as AI speech bots, digital assistants, AI NPCs in games, and more aim to simulate natural, human-like conversational capabilities. For these use cases, a TTFT in the realm of a few hundred milliseconds is crucial.
To understand the impact of TTFT on the user experience, consider the following animations. The first represents a TTFT of about half a second, while the second represents a TTFT of about five seconds.
Fast TTFT is particularly impactful in services where up-to-date knowledge is important, like in the recent rise of agentic workflows. To build useful agents, Retrieval-Augmented Generation (RAG) – which enhances LLM prompts with relevant data – is needed for accurate actions and responses. This means that contexts can be very long with tens or hundreds of thousands of tokens. Having a fast TTFT, even at such long contexts, makes these services feel more interactive.
Below we will show how NVIDIA GH200 NVL32 is able to achieve the fastest published TTFT for the Llama 3.1 models, even at very long contexts.
NVIDIA GH200 NVL32 supercharges TTFT for long context inference
To generate the first new token in response to an inference request, the input tokens must be processed by the LLM. This phase of inference, known as prefill, often has a large number of tokens and thus benefits from increased aggregate compute performance. It can be accelerated by splitting the calculations across multiple GPUs using parallelism techniques, such as tensor parallelism.
When computations are split across many GPUs using tensor parallelism, all GPUs involved in the computation must exchange data between all other GPUs in an AllReduce synchronization that happens twice per model layer. As the number of GPUs involved in the calculation increases, the total amount of synchronization traffic grows. Llama 3.1 405B incorporates 126 layers, yielding 252 AllReduce synchronizations per inference step. This means that running Llama 3.1 405B across 32 GPUs with an input sequence of 122,880 tokens, generates 114 TB of aggregate interconnect traffic. A high-bandwidth, low-latency all-to-all GPU-to-GPU fabric is needed to minimize time spent during these synchronizations and maximize time available GPUs spend on compute.
GH200 NVL32 is a rack scale solution that connects 32 NVIDIA GH200 Grace Hopper Superchips – each composed of an NVIDIA Grace CPU and an NVIDIA Hopper GPU connected via NVLink-C2C – using the NVLink Switch System. This allows each Hopper GPU to communicate with any other GPU within the NVLink domain at full 900 GB/s bandwidth, for 28.8 TB/s of aggregate bandwidth. The NVLink Switch System means the combined 32 GH200’s form “one mighty GPU” with up to 127 petaFLOPs of peak FP8 AI compute. This helps to dramatically shorten TTFT, particularly on the most demanding models with long context.
In scaling from eight NVIDIA H200 Tensor Core GPUs to 32 GH200 Grace Hopper Superchips, TTFT for a 122,880 token Llama 3.1 405B query, TTFT is accelerated by 3x, enabling a real-time experience. And, even for Llama 3.1 70B, TTFT for the same length query sees a 2.6x speedup in TTFT.
In the following sections, we show how GH200 NVL32 makes responsive, long context Llama 3.1 70B and 405B inference possible.
Llama 3.1 70B
A single GH200 NVL32 system achieves a TTFT of just 472 milliseconds when running Llama 3.1 70B, using an input sequence length of 32,768. In practical terms, this means that Llama 3.1 70B can begin outputting a summary of a 90-page document or coding suggestions on thousands of lines of code, in less than half a second.
Llama 3.1 70B Time to First token (milliseconds) (Lower is better) |
|
Input Sequence Length | GH200 NVL32 |
4,096 | 64 |
32,768 | 472 |
122,880 | 2,197 |
Data measured between 9/6/2024 and 9/10/2024 using an internal TensorRT-LLM development branch. Batch = 1.
And, for an input sequence length of 122,880 – approximately 15K lines of code or a 330 page book – GH200 NVL32 can achieve a TTFT of just 2.2 seconds.
Llama 3.1 405B
Llama 3.1 405B requires substantially more compute to generate the first token of a response, as the model incorporates nearly 6X the parameter count of Llama 3.1 70B.
Llama 3.1 405B Time to First token (milliseconds) (Lower is better) |
|
Input Sequence Length | GH200 NVL32 |
4,096 | 208 |
32,768 | 1,627 |
122,880 | 7,508 |
Data measured between 9/6/2024 and 9/10/2024 using an internal TensorRT-LLM development branch. Batch = 1.
GH200 NVL32, running Llama 3.1 405B, is able to provide a TTFT of about 1.6 seconds using a 32,768 token input. And, using a small codebase-sized 122,880 token input, GH200 NVL32 can begin responding in just 7.5 seconds.
Inference continues to be a hotbed of invention
The pace of inference innovation across serving techniques, runtime optimizations, kernels and more has been extraordinary. Advancements like in-flight batching, speculative decoding, FlashAttention, key-value caching, and more have been developed by both industry and academia. Collectively, these innovations are enabling more capable models and systems to be deployed efficiently and more cost-effectively in production, making powerful AI capabilities more accessible to the entire NVIDIA ecosystem.
To innovate quickly, researchers need a rich developer ecosystem and a productive tool stack. And, for innovations to have the greatest reach, a large platform installed base is required. The NVIDIA accelerated computing platform has more than 5 million developers, with an installed base of several hundred million GPUs across CSPs, on-prem, personal computers, and edge devices – all compatible with the CUDA programming model. Deep engagement with developers, computing providers, and customers enables and accelerates AI innovation on the NVIDIA platform.
Next up: accelerating agentic workflows
Agentic workflows perform tree search, self-reflection, and iterative inferences to reason and produce answers to complex queries. This means that the number of inferences per prompt will grow by orders of magnitude. With each successive inference, we would need to process the aggregate response in the next agent as a new context — thus fast TTFT becomes even more important as workflows scale.
Fast token generation speed is also important for agentic workflows. In a future chapter, we will provide an update on accelerating token generation speed by scaling to many more GPUs on the NVIDIA platform with NVLink and the NVLink Switch system.
NVIDIA Blackwell GB200 NVL72 powers a new era of computing
Looking ahead, as model sizes continue to grow rapidly, and as models support even longer context lengths, and agentic workflows become more popular, the amount of delivered compute performance required for fast inference continues to rise.
The GB200 NVL72, based on the NVIDIA Blackwell platform, delivers the next giant leap for generative AI and accelerated computing. With second-generation Transformer Engine and fifth-generation Tensor Cores, Blackwell delivers up to 20 PFLOPS of FP4 AI compute – up 5x the AI compute of NVIDIA Hopper. And, fifth-generation NVLink provides 1,800 GB/s of GPU-to-GPU bandwidth – twice that provided by Hopper – and expands NVLink domain size to 72 GPUs with the GB200 NVL72 rack-scale system, enabled by the latest NVLink Switch chip.
NVIDIA continues to innovate at every layer of the technology stack to increase performance, reduce total cost of ownership, and enable the next-generation of AI.
This blog is part of a series – view Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch.