NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice containers for AI model inference, constantly improving enterprise-grade generative AI performance. With the upcoming NIM version 1.4 scheduled for release in early December, request performance is improved by up to 2.4x out-of-the-box with the same single-command deployment experience.

At the core of NIM are multiple LLM inference engines, including NVIDIA TensorRT-LLM, which enables it to achieve speed-of-light inference performance. With each release, NIM incorporates the latest advancements in kernel optimizations, memory management, and scheduling from these engines to improve performance.

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference — *Figure 1. NVIDIA NIM 1.4 throughput compared to NIM 1.2. Llama 3.1 70B 2xH200-SXM input tokens 8K, output tokens 256. Llama 3.1 8B 1xH100-SXM input tokens 30K, output tokens 256*

In NIM 1.4, significant improvements in kernel efficiency, runtime heuristics, and memory allocation were added, translating into up to 2.4x faster inferencing, compared to NIM 1.2. These advancements are crucial for businesses that rely on quick responses and high throughput for generative AI applications.

NIM also benefits from continuous updates to full-stack accelerated computing, which enhances performance and efficiency at every level of the computing stack. This includes support for the latest NVIDIA TensorRT and NVIDIA CUDA versions, further boosting inference performance. NIM users benefit from these continuous improvements without manually updating software.

An image shows a chart of request latency in seconds across different request per-second values for the Llama 3.1 8B NIM version 1.4 versus the Llama 3.1 8B NIM version 1.2, showing 2x faster request latency for NIM 1.4 compared with NIM 1.2. — *Figure 2. NVIDIA Llama 3.1 8B NIM 1.4 versus Llama 3.1 8B NIM 1.2 running on 1x H100SXM, input tokens 30K, output tokens 256*

NIM brings together a full suite of preconfigured software to deliver high-performance AI inferencing with minimal setup, enabling developers to quickly get started with high-performance inference.

A continuous innovation loop means that every improvement in TensorRT-LLM, CUDA, and other core accelerated computing technologies immediately benefits NIM users. Updates are seamlessly integrated and delivered through updates to NIM microservice containers, eliminating the need for manual configuration and reducing the engineering overhead typically associated with maintaining high-performance inference solutions.

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

Get started today

A Growing Threat for SMBs

AuthenticID Launches Velocity Checks to Combat Identity Fraud with Advanced Biometrics

Evaluating Medical RAG with NVIDIA AI Endpoints and Ragas

Physical Edition Release Coming Soon!

Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy

Related articles

SpaceX prevails over ULA, wins military launch contracts worth $733 million

Western Digital Introduces 4 TB microSDUC, 8 TB SDUC, and 16 TB External SSDs

Google Flights Introduces Cheapest Tab to Help Travelers Save Big

This What We Do in the Shadows Portrait Is Dead and Out of This World