Accelerating Leaderboard-Topping ASR Models 10x with NVIDIA NeMo

NVIDIA NeMo has consistently developed automatic speech recognition (ASR) models that set the benchmark in the industry, particularly those topping the Hugging Face Open ASR Leaderboard.

These NVIDIA NeMo ASR models that transcribe speech into text offer a range of architectures designed to optimize both speed and accuracy:

CTC model (nvidia/parakeet-ctc-1.1b): This model features a FastConformer encoder and a softmax prediction head. It’s non-autoregressive, meaning future predictions do not depend on the previous ones, enabling fast and efficient inference.
RNN-T model (nvidia/parakeet-rnnt-1.1b): This transducer model adds a prediction and joint network to the FastConformer encoder, making it autoregressive—each prediction depends on the previous prediction history. Due to this property, there is a common misconception that RNN-T models are slow for GPU inference and better suited to CPUs.

TDT model (nvidia/parakeet-tdt-1.1b): Another transducer model, but trained with a refined transducer objective called token-and-duration transducer (TDT). While still autoregressive, it can perform multiple predictions at each step, making it faster at inference.
TDT-CTC model (parakeet-tdt_ctc-110m): This is a hybrid variant of transducer and CTC decoders, bringing both decoders for faster convergence during training. It enables training only one model for two decoders.
AED model (nvidia/canary-1b): Attention-encoder-decoder (AED) model, also based on the FastConformer, is autoregressive and offers the highest accuracy (lowest word error rate, or WER) at the cost of additional computation.

Previously, these models faced speed performance bottlenecks such as casting overheads, low compute intensity, and divergence performance issues.

In this post, you’ll discover how NVIDIA boosted the inference speed of NeMo ASR models by up to 10x (Figure 1) through key enhancements like autocasting tensors to bfloat16, the innovative label-looping algorithm, and the introduction of CUDA Graphs available with NeMo 2.0.0.

Accelerating Leaderboard-Topping ASR Models 10x with NVIDIA NeMo — *Figure 1. NVIDIA NeMo ASR models achieve up to 10x speed improvement in inverse real-time factor (RTFx) through recent optimizations*

CPU/GPU	AWS instance	Hourly cost	# of vCPU/GPU	Streams per instance*	RTFx (single unit)	Total RTFx	Cost of 1M hr transcription
AMD Epyc 4th Gen	C7a.48xlarge	$9.85	192	192	4.5	864	$11,410
NVIDIA A100 80GB	P4de.24xlarge	$40.97	8	512	2053	16425	$2,499

Accelerating Leaderboard-Topping ASR Models 10x with NVIDIA NeMo

Overcoming speed performance bottlenecks

Casting overheads from automatic mixed precision

Resolving AMP overheads with full half-precision inference

Optimizing batch processing for enhanced performance

Low compute intensity in RNN-T and TDT prediction networks

Eliminating low compute intensity with dynamic control flow in CUDA Graphs conditional nodes

Divergence in RNN-T and TDT prediction networks

Solving divergence with efficient new decoding algorithm

Performance enhancements up to 10x faster and up to 4.5x more cost-effective

Accelerate your transcriptions with NVIDIA ASR

Perplexity lets you search your internal enterprise files and the web

Physical Edition Release Coming Soon!

Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy

A Growing Threat for SMBs

Ransomware Attacks Growing More Dangerous

Related articles

How marketers should advertise to mobile gamers during the holidays | Unity

Is TSMC Secretly Supplying Chips to Huawei? U.S. Launches Probe

218 Layers With Superior Scaling

Supermicro Launches NVIDIA BlueField-Powered JBOF to Optimize AI Storage