Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B

Today, NVIDIA released a unique language model that delivers an unmatched accuracy-efficiency performance. Llama 3.1-Nemotron-51B, derived from Meta’s Llama-3.1-70B, uses a novel neural architecture search (NAS) approach that results in a highly accurate and efficient model.

The model fits on a single NVIDIA H100 GPU at high workloads, making it much more accessible and affordable. The excellent accuracy-efficiency sweet spot exhibited by the new model stems from changes to the model’s architecture that lead to a significantly lower memory footprint, reduced memory bandwidth, and reduced FLOPs while maintaining excellent accuracy. We demonstrate that this approach can be generalized by creating another smaller and faster variant from the reference model.

In July 2024, Meta released Llama-3.1-70B, a leading state-of-the-art large language model (LLM). Today we announce Llama 3.1-Nemotron-51B-Instruct, developed using NAS and knowledge distillation derived from the reference model, Llama-3.1-70B.

	Accuracy		Efficiency
	MT Bench	MMLU	Text generation (128/1024)	Summarization/ RAG (2048/128)
Llama-3.1- Nemotron-51B- Instruct	8.99	80.2%	6472	653
Llama 3.1-70B- Instruct	8.93	81.66%	2975	339
Llama 3.1-70B- Instruct (single GPU)	—	—	1274	301
Llama 3-70B	8.94	80.17%	2975	339

Benchmark	Llama-3.1 70B-instruct	Llama-3.1-Nemotron-51B- Instruct	Accuracy preserved
winogrande	85.08%	84.53%	99.35%
arc_challenge	70.39%	69.20%	98.30%
MMLU	81.66%	80.20%	98.21%
hellaswag	86.44%	85.58%	99.01%
gsm8k	92.04%	91.43%	99.34%
truthfulqa	59.86%	58.63%	97.94%
xlsum_english	33.86%	31.61%	93.36%
MMLU Chat	81.76%	80.58%	98.55%
gsm8k Chat	81.58%	81.88%	100.37%
Instruct HumanEval (n=20)	75.85%	73.84%	97.35%
MT Bench	8.93	8.99	100.67%

Scenario	Input/Output Sequence Length	Llama-3.1- Nemotron-Instruct	Llama-3.1-70B-Instruct	Ratio	Llama (TP1)
Chatbot	128/128	5478 (TP1)	2645 (TP1)	2.07	2645
Text generation	128/1024	6472 (TP1)	2975 (TP4)	2.17	1274
Long text generation	128/2048	4910 (TP2)	2786 (TP4)	1.76	646
System 2 reasoning	128/4096	3855 (TP2)	1828 (TP4)	2.11	313
Summarization/ RAG	2048/128	653 (TP1)	339 (TP4)	1.92	300
Stress test 1	2048/2048	2622 (TP2)	1336 (TP4)	1.96	319

	Accuracy		Speed
	MT bench	MMLU	Text generation (128/1024)	Summarization/ RAG (2048/128)
Llama-3.1- Nemotron-40B-Instruct	8.69	77.10%	9568	862
Llama-3.1- Nemotron-51B-Instruct	8.99	80.20%	6472	653
Llama 3.1-70B-Instruct	8.93	81.72%	2975	339

Superior throughput and workload efficiency

Optimized accuracy per dollar

Simplifying inference with NVIDIA NIM

Building the model with NAS

Detailed results

Model accuracy

Performance

Tailoring LLMs for diverse needs

Summary

Leave a comment Cancel reply

Graphi Max

Navigation

Categories