NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

The demand for ready-to-deploy high-performance inference is growing as generative AI reshapes industries. NVIDIA NIM provides production-ready microservice containers for AI model inference, constantly improving enterprise-grade generative AI performance. With the upcoming NIM version 1.4 scheduled for release in early December, request performance is improved by up to 2.4x out-of-the-box with the same single-command deployment experience.

At the core of NIM are multiple LLM inference engines, including NVIDIA TensorRT-LLM, which enables it to achieve speed-of-light inference performance. With each release, NIM incorporates the latest advancements in kernel optimizations, memory management, and scheduling from these engines to improve performance.

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference — *Figure 1. NVIDIA NIM 1.4 throughput compared to NIM 1.2. Llama 3.1 70B 2xH200-SXM input tokens 8K, output tokens 256. Llama 3.1 8B 1xH100-SXM input tokens 30K, output tokens 256*

In NIM 1.4, significant improvements in kernel efficiency, runtime heuristics, and memory allocation were added, translating into up to 2.4x faster inferencing, compared to NIM 1.2. These advancements are crucial for businesses that rely on quick responses and high throughput for generative AI applications.

NIM also benefits from continuous updates to full-stack accelerated computing, which enhances performance and efficiency at every level of the computing stack. This includes support for the latest NVIDIA TensorRT and NVIDIA CUDA versions, further boosting inference performance. NIM users benefit from these continuous improvements without manually updating software.

An image shows a chart of request latency in seconds across different request per-second values for the Llama 3.1 8B NIM version 1.4 versus the Llama 3.1 8B NIM version 1.2, showing 2x faster request latency for NIM 1.4 compared with NIM 1.2. — *Figure 2. NVIDIA Llama 3.1 8B NIM 1.4 versus Llama 3.1 8B NIM 1.2 running on 1x H100SXM, input tokens 30K, output tokens 256*

NIM brings together a full suite of preconfigured software to deliver high-performance AI inferencing with minimal setup, enabling developers to quickly get started with high-performance inference.

A continuous innovation loop means that every improvement in TensorRT-LLM, CUDA, and other core accelerated computing technologies immediately benefits NIM users. Updates are seamlessly integrated and delivered through updates to NIM microservice containers, eliminating the need for manual configuration and reducing the engineering overhead typically associated with maintaining high-performance inference solutions.

NVIDIA NIM 1.4 Ready to Deploy with 2.4x Faster Inference

Get started today

Physical Edition Release Coming Soon!

Advancing Quantum Algorithm Design with GPTs

Teleport Introduces ‘Crown Jewel Observability’ to Strengthen Critical Infrastructure Access Control

Grand Theft Auto: San Andreas – Giant Theft Auto [RETRO-2004]

New AI-Powered 3D Printing Can Help Surgeons Rehearse Procedures

Related articles

That’s that me espresso: Olympia Express’ prestige coffee machines

AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others

Meta Cracks Down on Sextortion on Instagram: No More Screenshots in DMs

Accelerate Large Linear Programming Problems with NVIDIA cuOpt