NVIDIA’s H100 GPU is now twice as fast in LLM inferencing thanks to latest software update

NVIDIA has announced the latest TensorRT-LLM software that managed to push its top-of-the-line H100 cards achieving doubled output.

The optimization effort comes as a result of close workings with leading companies that heavily utilize AI as part of daily operations as Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, and MosaicML all for one purpose of accelerating LLM inferencing through pure software magic.

The resulting package is the open-source TensorRT-LLM software usable by Ampere, Lovelace, and Hopper GPUs comprising of TensorRT deep learning compiler, pre and post-processing steps, optimized kernels, and multi-GPU/multi-node communication while shedding the needs for highly technical C++ / NVIDIA CUDA knowledge.

Citing official sources, the software update alone brought another fold of performance improvements, reaching up to twice the inferencing output in GPT-J 6B and about 1.77x for Meta’s Llama2.

As performance output gets higher, especially on a multiplication scale, will also bring the Total Cost of Ownership (TCO) and energy consumption rating to a lower and better scale, resulting in better financial management and flexibility for data center owners – scale down for lower cost or scale up for more hardware at the same baseline cost.

When delving into the technical details of TensorRT-LLM, NVIDIA credits Tensor Parallelism which splits individual weight matrices across devices for efficient inferencing at scale + In-flight Batching that does almost the same thing to requests but instead of processing by per batch, they are immediately passed to the next phase instead of waiting.

The ability to convert model weights into the new FP8 format made possible through the Hopper Transformer Engine enabling a fast quantization process with reduced memory consumption is also one of the big keys to why such performance is possible on the H100.

Availability

Early access for the NVIDIA TensorRT-LLM is now available and will soon be integrated into the NeMo framework for NVIDIA AI Enterprise. Additionally, a lot of the ready-to-run versions are already optimized including Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, BLOOM, and a dozen others – all with an easy-to-use Python API.

'Lite' version of the KLEVV CRAS C910 PCIe 4.0 SSD announced

News

Black Myth: Wukong packed in for free via new GeForce RTX 40 Series Bundle sales

by Calvin Liew

July 10, 2024

A lot of people are waiting for the "Chinese Souls-like" Black Myth: Wukong set to release next month as developer...

News

NVIDIA’s Malaysia stop talks about GeForce RTX AI PCs and their capabilities; Proceeds to collab with brands to give new GeForce RTX 40 series users some sweet deals

by Calvin Liew

June 30, 2024

The official NVIDIA team has landed in Malaysia last week to spread the good news about the GeForce RTX 40...

New NVIDIA GeForce Game Ready Driver update brings DLSS 3.5 with Ray Reconstruction to The First Descendant and DLSS 3 to PAYDAY 3 and Riven

June 28, 2024

NVIDIA CEO Jensen Huang expects DLSS 4 to introduce “texture and model generation”

June 27, 2024

NVIDIA DLSS 3 Pax Dei, Still Wakes the Deep, Skye The MIsty Isle

Pax Dei, Still Wakes the Deep, and Skye: The Misty Isle all getting blessed with NVIDIA DLSS 3 in latest GeForce Game Ready Driver update

June 22, 2024

Dell Technologies and Supermicro will be the NVIDIA partner that powers Elon Musk’s xAI backbone

June 20, 2024

Subscribe via Email

Calvin Liew

Ex-competitive rhythm gamer who is always the "Good but not the best". You'd know me as Vindy if you know where to look. Currently on a quest to own enough keyboards with different plates and just slapping MX Black on them.