Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B

Share post:

Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta’s Llama-3.1-70B, uses a novel neural architecture search (NAS) approach that results in a highly accurate and efficient model.

The model fits on a single NVIDIA H100 GPU at high workloads, making it much more accessible and affordable. The excellent accuracy-efficiency sweet spot exhibited by the new model stems from changes to the model’s architecture that lead to a significantly lower memory footprint, reduced memory bandwidth, and reduced FLOPs while maintaining excellent accuracy. We demonstrate that this approach can be generalized by creating another smaller and faster variant from the reference model.

In July 2024, Meta released Llama-3.1-70B, a leading state-of-the-art large language model (LLM). Today we announce Llama 3.1-Nemotron-51B-Instruct, developed using NAS and knowledge distillation derived from the reference model, Llama-3.1-70B. 

Superior throughput and workload efficiency

The Nemotron model yields 2.2x faster inference compared to the reference model while maintaining nearly the same accuracy. The model opens a new set of opportunities with a reduced memory footprint, which enables running 4x larger workloads on a single GPU during inference.

  Accuracy Efficiency
  MT Bench MMLU Text generation (128/1024) Summarization/ RAG (2048/128)
Llama-3.1- Nemotron-51B- Instruct 8.99 80.2% 6472 653
Llama 3.1-70B- Instruct 8.93 81.66% 2975 339
Llama 3.1-70B- Instruct (single GPU) 1274 301
Llama 3-70B 8.94 80.17% 2975 339
Table 1. Overview of the Llama-3.1-Nemotron-51B-Instruct accuracy and efficiency.

Note: Speed is reported in tokens per second per GPU, Measured on machines equipped with 8 X NVIDIA H100 SXM GPUs, with FP8 quantization using TRT-LLM as the runtime engine. For each model with the optimal number of GPUs through tensor parallelism (unless otherwise stated). The numbers in the brackets show the (input/output sequence lengths).

We discuss the detailed performance metrics later in this post.

Optimized accuracy per dollar

Foundation models display incredible quality in solving complex tasks: reasoning, summarization, and more. However, a major challenge in the adoption of ‌top models is their inference cost.

As the field of generative AI evolves, the balance between accuracy and efficiency (directly impacting cost) will become the decisive factor in model selection. Moreover, the capability to run a model on a single GPU significantly streamlines its deployment, opening opportunities for new applications to run anywhere, from edge systems to data centers to the cloud, as well as facilitating serving multiple models via Kubernetes and NIM blueprints. 

Consequently, we engineered Llama 3.1-Nemotron-51B-Instruct to achieve this optimal tradeoff (Figure 1). Throughput is inversely proportional to price, so the best tradeoff is obtained by models on the efficient frontier displayed in the chart. Figure 1 shows that the model pushes beyond the current efficient frontier, making it the model that provides the best accuracy per dollar.

Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B
Figure 1. Accuracy vs. Throughput performance of Llama-3.1-Nemotron-51B compared to frontier models. Throughput was measured through NIM with concurrency 25 (serving throughput).

The model quality is defined as the weighted average of MT-Bench and MMLU (10*MT-Bench + MMLU)/2, plotted compared to model throughput per a single NVIDIA H100 80GB GPU. Gray dots represent state-of-the-art models, while the dashed line represents the ‘efficient frontier’.

Simplifying inference with NVIDIA NIM

The Nemotron model is optimized with TensorRT-LLM engines for higher inference performance and packaged as an NVIDIA NIM microservice to streamline and accelerate the deployment of generative AI models across NVIDIA accelerated infrastructure anywhere, including cloud, data center, and workstations.

NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand.

Building the model with NAS

Inference and hardware-aware methods for designing neural architectures have been successfully used in many domains. However, LLMs are still constructed as repeated identical blocks, with little regard for inference cost overheads incurred by this simplification. To tackle these challenges, we developed efficient NAS technology and training methods that can be used to create non-standard transformer models designed for efficient inference on specific GPUs.

Our technology can select neural architectures that optimize various constraints. The range includes enormous design spaces that include a zoo of non-standard transformer models using alternative attention and FFN blocks of varying efficiency degrees, up to a complete block elimination in the extreme case.

We then use our block-distillation (Figure 2) framework to train all these block variants for all layers of a (large) parent LLM in parallel. In a basic version of block-distillation, training data is passed through the reference model  (also known as a teacher).

For each block, its input is taken from the teacher and injected into the matching block of the student. The outputs of the teacher and student for the block are compared and the student block is trained so that the student block mimics the functionality of the teacher block. A more advanced scenario where a single student block mimics multiple teacher blocks is depicted on the right side in Figure 2.

The image displays two diagrams illustrating the block-distillation process. In each diagram, a student block (yellow) is trained to mimic either one or multiple blocks of the teacher model (blue). The training aims to minimize the difference between the input to the teacher block (red curved line) and the output of the teacher block (black curved line). This minimization is achieved using a loss function (depicted in green). The left diagram shows one teacher block being mimicked by the student block, while the right diagram demonstrates the student block mimicking the operation of two consecutive teacher blocks.
Figure 2. Block distillation where blue reference model blocks are multiple variants for the yellow student models that mimic the block-wise teacher functionality

Next, we use our Puzzle algorithm to efficiently score each alternative replacement puzzle piece and search our enormous design space for the most accurate models, while adhering to a set of inference constraints, such as memory size and required throughput.

Finally, by using knowledge distillation (KD) loss for both block scoring and training, we demonstrate the potential to narrow the accuracy gap between our model and the reference model using a much more efficient architecture with a tiny fraction of the reference model training costs. Using our methods on Llama-3.1-70B as the reference model, we built ​​Llama-3.1-Nemotron-51B-Instruct, a 51B model that breaks the efficient frontier of LLMs on a single NVIDIA H100 GPU (Figure 1). 

The ​​Llama-3.1-Nemotron-51B-Instruct architecture is unique in its irregular block structure with many layers in which the attention and FFN are reduced or pruned, resulting in better utilization of H100 and highlighting the importance of optimizing LLMs for inference. Figure 3 schematically depicts the irregular structure of the resulting architecture and highlights the resulting compute saving, which amounts to the green area in the Figure.

The image shows runtime of puzzle chosen blocks that are depicted as layers. They are color coded, blue blocks attention layers, red blocks for FFN layers, and green blocks represent overall runtime savings. The image shows 80 layers of the reference models.
Figure 3. Runtime of Puzzle chosen blocks (layers) for attention layers (blue) and FFN layers (red) across the 80 layers of the reference model. Green areas correspond to overall runtime savings.

Our innovative techniques enable us to develop models that redefine the efficient frontier of LLMs. Crucially, we can cost-effectively design multiple models from a single reference model, each optimized for specific hardware and inference scenarios. This capability empowers us to maintain best-in-class performance for LLM inference across our current and future hardware platforms.

Detailed results

Here are the model accuracy and performance metrics for our model.

Model accuracy

Table 2 lists all the benchmarks that we evaluated, comparing our model and the reference model Llama-3.1-70B. The Accuracy preserved column is the ratio between our model’s score and that of the teacher. 

Benchmark Llama-3.1 70B-instruct Llama-3.1-Nemotron-51B- Instruct Accuracy preserved
winogrande 85.08% 84.53% 99.35%
arc_challenge 70.39% 69.20% 98.30%
MMLU 81.66% 80.20% 98.21%
hellaswag 86.44% 85.58% 99.01%
gsm8k 92.04% 91.43% 99.34%
truthfulqa 59.86% 58.63% 97.94%
xlsum_english 33.86% 31.61% 93.36%
MMLU Chat 81.76% 80.58% 98.55%
gsm8k Chat 81.58% 81.88% 100.37%
Instruct HumanEval (n=20) 75.85% 73.84% 97.35%
MT Bench 8.93 8.99 100.67%
Table 2. Accuracy comparison of the Nemotron model to the Llama-3.1-70B-Instruct model across several industry benchmarks

Performance

Table 3 shows the number of tokens per second per GPU (NVIDIA H100 80-GB GPU). You can see that for a range of relevant scenarios, short and long inputs as well as outputs, our model doubles the throughput of the teacher model, making it cost-effective across multiple use cases.

TPX describes the number of GPUs on which the process runs in parallel. We also list the performance of Llama 3.1-70B on a single GPU to demonstrate the value of our model in such a setting.

Scenario Input/Output Sequence Length Llama-3.1- Nemotron-Instruct Llama-3.1-70B-Instruct Ratio Llama (TP1)
Chatbot 128/128 5478 (TP1) 2645 (TP1) 2.07 2645
Text generation 128/1024 6472 (TP1) 2975 (TP4) 2.17 1274
Long text generation 128/2048 4910 (TP2) 2786 (TP4) 1.76 646
System 2 reasoning 128/4096 3855 (TP2) 1828 (TP4) 2.11 313
Summarization/ RAG 2048/128 653 (TP1) 339 (TP4) 1.92 300
Stress test 1 2048/2048 2622 (TP2) 1336 (TP4) 1.96 319
Table 3. Throughput comparison of the number of tokens generated by the models for popular use cases. All numbers are in tokens per second per GPU.

The main factor in determining the cost of running a model is throughput, the total number of tokens that the system can generate in one second. However, in some scenarios (for example, chatbots), the rate at which a single end user receives the response from the model is important for the user experience. This is quantified by the tokens per second per user, termed the user-side throughput.

Figure 4 shows this user-side throughput plotted against the throughput at different batch sizes. As seen in all batch sizes, our model is superior to Llama-3.1-70B. 

The chart shows that the Llama 3.1-70B curve is lower compared to Llama-3.1-Nemotron-51B-Instruct.
Figure 4. Server throughput vs. user-side throughput, plotted at different batch sizes for the Nemotron model and for Llama-3.1-70B

Tailoring LLMs for diverse needs 

The NAS approach offers you flexibility in selecting the optimal balance between accuracy and efficiency. To demonstrate this versatility, we created another variant from the same reference model, this time prioritizing speed and cost. Llama-3.1-Nemotron-40B-Instruct was developed using the same methodology but with a modified speed requirement during the puzzle phase. 

This model achieves a 3.2x speed increase compared to the parent model, with a moderate decrease in accuracy. Table 4 shows competitive performance metrics.

Accuracy Speed
MT bench MMLU Text generation (128/1024) Summarization/ RAG (2048/128)
Llama-3.1- Nemotron-40B-Instruct 8.69 77.10% 9568 862
Llama-3.1- Nemotron-51B-Instruct 8.99 80.20% 6472 653
Llama 3.1-70B-Instruct 8.93 81.72% 2975 339
Table 4. Overview of the Llama-3.1-Nemotron-40B-Instruct accuracy and efficiency

Summary

Llama 3.1-Nemotron-51B-Instruct provides a new set of opportunities for users and companies that want to use highly accurate foundation models, but do so in a cost-controlled manner. By providing the best tradeoff between accuracy and efficiency, we believe the model is an attractive option for builders. Moreover, these results demonstrate the effectiveness of the NAS approach and intend to extend the method to other models. 

Related articles

Microsoft Cuts Off Azure OpenAI Access for Chinese Developers

Microsoft is preparing to stop providing its Azure OpenAI Service for individual developers in mainland China. Starting October...

Producing Cinematic Content at Scale with a Generative AI-Enabled OpenUSD Pipeline

Producing commercials is resource-intensive, requiring physical locations and various props and setups to display products in different settings...