With LLMs becoming bigger and smarter over time, this just means that they chew through way more compute. It’s not just about size either, because today’s models like to “think out loud” with tons of intermediate reasoning tokens before giving you an answer. Put those two trends together, and the demand for raw performance skyrockets.

NVIDIA Blackwell Ultra MLPerf Inference Benchmark

But it seems that NVIDIA’s Blackwell Ultra architecture, which powers the super flagship GB300 NVL72 system, is still capable of tackling them, as shown through the latest MLPerf Inference v5.1 results. This round of benchmarks included some new heavyweights like DeepSeek-R1 with its massive 671B parameter MoE architecture, Llama 3.1 in both 405B and 8B variants, and Whisper, which just replaced RNN-T after blowing up on HuggingFace with nearly 5 million downloads in a month.

NVIDIA not only submitted results with its Blackwell GPUs, but also debuted the new Blackwell Ultra architecture, and it smashed records across the board.NVIDIA Blackwell Ultra MLPerf Inference Benchmark Record

On DeepSeek-R1, the GB300 NVL72 system delivered up to 45% higher performance per GPU compared to its own GB200 predecessor, and about 5x the throughput over Hopper-based systems. That kind of jump translates to much lower cost per token and higher AI factory output. A lot of it came down to clever optimizations like squeezing weights into NVFP4 four-bit floating point for higher throughput, converting KV caches to FP8 to shrink memory usage, and introducing new parallelism strategies that keep every GPU busy without creating bottlenecks.

Llama 3.1 also pushed things further, especially the new 405B interactive benchmark, where time-to-first-token and tokens-per-user demands are even tighter. To hit those numbers, NVIDIA leaned on techniques like disaggregated serving and NVLink-powered all-to-all GPU communication, unlocking nearly 1.5x better throughput per GPU compared to older setups.

When you put it all together, Blackwell Ultra is showing not just incremental gains but architectural leaps such as higher memory capacity, stronger attention compute, and smarter software stacks like TensorRT-LLM and CUDA Graphs that make every cycle count.

Click here to check out more numbers if you’re interested.

Facebook
Twitter
LinkedIn
Pinterest

Leave a Reply

Related Posts

Subscribe via Email

Enter your email address to subscribe to Tech-Critter and receive notifications of new posts by email.