Improved Performance and Monitoring Capabilities with...

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVIDIA NVLink, or networking. It uses advanced topology detection, optimized communication graphs, and tuning models to get the best performance straight out of the box on NVIDIA GPU platforms.

In this post, we discuss the new features and fixes released in NCCL 2.26. For more details, visit the NVIDIA/nccl GitHub repo. Note that the NCCL 2.25 release was solely focused on NVIDIA Blackwell platform support, with no library feature changes. For this reason, no release post has been published for that version.

Improved Performance and Monitoring Capabilities with...

Release highlights

NCCL 2.26 features

PAT optimization

Implicit launch order

GPU kernel and network profiler support

GPU kernel events

Network defined events

Network plugin QoS support

RAS Improvements

Bug fixes and minor features

Summary