Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy

This post was originally published August 21, 2024 but has been revised with current data.

Recently, NVIDIA and Mistral AI unveiled Mistral NeMo 12B, a leading state-of-the-art large language model (LLM). Mistral NeMo 12B consistently outperforms similarly sized models on a wide range of benchmarks.

We announced Mistral-NeMo-Minitron 8B, one of the most advanced open-access models in its size class. This model consistently delivers leading accuracy on nine popular benchmarks. The Mistral-NeMo-Minitron 8B base model was obtained by width-pruning the Mistral NeMo 12B base model, followed by a light retraining process using knowledge distillation. This is a successful recipe that NVIDIA originally proposed in the paper, Compact Language Models via Pruning and Knowledge Distillation. It’s been proven time and again with NVIDIA Minitron 8B and 4B, and Llama-3.1-Minitron 4B models.

Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy — *Figure 1. Model pruning and distillation for Mistral-NeMo-Minitron-8B-Base and -Instruct models*

In Figure 1, the Nemotron-4-340B-Instruct and -Reward models were used to generate synthetic data for the alignment.

	MMLU 5-shot	GMS8k 0-shot	GPQA 0-shot	HumanEval0-shot	MBPP 0-shot	IFEval	MTBench (GPT4-Turbo)	BFCL v2 Live
Mistral-NeMo-Minitron 8B Instruct	70.4	87.1	31.5	71.3	72.5	84.4	7.86	67.6
Llama-3.1-8B-Instruct	69.4	83.9	30.4	72.6	72.8	79.7	7.78	44.3
Mistral-NeMo-12B-Instruct	68.4	79.8	28.6	68.3	66.7	64.7	8.10	47.9

Table 1. Accuracy of the Mistral-NeMo-Minitron-8B-Instruct model compared to Llama-3.1-8B-Instruct and the teacher Mistral-NeMo-12B models. Bold numbers represent the best amongst the 8B model class.

	Training tokens	Wino-Grande 5-shot	ARC Challenge 25-shot	MMLU 5-shot	Hella Swag 10-shot	GSM8K 5-shot	TruthfulQA 0-shot	XLSum en (20%) 3-shot	MBPP 0-shot	Human Eval 0-shot
Llama-3.1-8B	15T	77.27	57.94	65.28	81.80	48.60	45.06	30.05	42.27	24.76
Gemma-7B	6T	78	61	64	82	50	45	17	39	32
Mistral-NeMo-Minitron-8B	380B	80.35	64.42	69.51	83.03	58.45	47.56	31.94	43.77	36.22
Mistral-NeMo-12B	N/A	82.24	65.10	68.99	85.16	56.41	49.79	33.43	42.63	23.78

Table 2. Accuracy of the Mistral-NeMo-Minitron-8B-Base model compared to Llama-3.1-8B-Base and the teacher Mistral-NeMo-12B models. Bold numbers represent the best amongst the 8B model class.

Overview of model pruning and distillation

Mistral-NeMo-Minitron 8B

Teacher fine-tuning

Width-only pruning

Distillation parameters

Mistral-NeMo-Minitron-8B-Instruct

Performance benchmarks

Conclusion

Acknowledgments

Leave a comment Cancel reply

Graphi Max

Navigation

Categories