NVIDIA Publicizes H100 NVL – Max Reminiscence Server Card for Massive Language Fashions
Whereas this yr’s Spring GTC occasion doesn’t function any new GPUs or GPU architectures from NVIDIA, the corporate continues to be within the means of rolling out new merchandise primarily based on the Hopper and Ada Lovelace GPUs its launched prior to now yr. On the high-end of the market, the corporate immediately is asserting a brand new H100 accelerator variant particularly aimed toward giant language mannequin customers: the H100 NVL.
The H100 NVL is an fascinating variant on NVIDIA’s H100 PCIe card that, in an indication of the occasions and NVIDIA’s in depth success within the AI discipline, is aimed toward a singular market: giant language mannequin (LLM) deployment. There are some things that make this card atypical from NVIDIA’s standard server fare – not the least of which is that it’s 2 H100 PCIe boards that come already bridged collectively – however the large takeaway is the large reminiscence capability. The mixed dual-GPU card presents 188GB of HBM3 reminiscence – 94GB per card – providing extra reminiscence per GPU than every other NVIDIA half so far, even inside the H100 household.
NVIDIA H100 Accelerator Specification Comparability | |||||
H100 NVL | H100 PCIe | H100 SXM | |||
FP32 CUDA Cores | 2 x 16896? | 14592 | 16896 | ||
Tensor Cores | 2 x 528? | 456 | 528 | ||
Enhance Clock | 1.98GHz? | 1.75GHz | 1.98GHz | ||
Reminiscence Clock | ~5.1Gbps HBM3 | 3.2Gbps HBM2e | 5.23Gbps HBM3 | ||
Reminiscence Bus Width | 6144-bit | 5120-bit | 5120-bit | ||
Reminiscence Bandwidth | 2 x 3.9TB/sec | 2TB/sec | 3.35TB/sec | ||
VRAM | 2 x 94GB (188GB) | 80GB | 80GB | ||
FP32 Vector | 2 x 67 TFLOPS? | 51 TFLOPS | 67 TFLOPS | ||
FP64 Vector | 2 x 34 TFLOPS? | 26 TFLOPS | 34 TFLOPS | ||
INT8 Tensor | 2 x 1980 TOPS | 1513 TOPS | 1980 TOPS | ||
FP16 Tensor | 2 x 990 TFLOPS | 756 TFLOPS | 990 TFLOPS | ||
TF32 Tensor | 2 x 495 TFLOPS | 378 TFLOPS | 495 TFLOPS | ||
FP64 Tensor | 2 x 67 TFLOPS? | 51 TFLOPS | 67 TFLOPS | ||
Interconnect | NVLink 4 18 Hyperlinks (900GB/sec) |
NVLink 4 (600GB/sec) |
NVLink 4 18 Hyperlinks (900GB/sec) |
||
GPU | 2 x GH100 (814mm2) |
GH100 (814mm2) |
GH100 (814mm2) |
||
Transistor Depend | 2 x 80B | 80B | 80B | ||
TDP | 700W | 350W | 700-800W | ||
Manufacturing Course of | TSMC 4N | TSMC 4N | TSMC 4N | ||
Interface | 2 x PCIe 5.0 (Quad Slot) |
PCIe 5.0 (Twin Slot) |
SXM5 | ||
Structure | Hopper | Hopper | Hopper |
Driving this SKU is a particular area of interest: reminiscence capability. Massive language fashions just like the GPT household are in lots of respects reminiscence capability certain, as they’ll rapidly refill even an H100 accelerator as a way to maintain all of their parameters (175B within the case of the biggest GPT-3 fashions). Consequently, NVIDIA has opted to scrape collectively a brand new H100 SKU that provides a bit extra reminiscence per GPU than their standard H100 components, which prime out at 80GB per GPU.
Beneath the hood, what we’re taking a look at is actually a particular bin of the GH100 GPU that’s being positioned on a PCIe card. All GH100 GPUs include 6 stacks of HBM reminiscence – both HBM2e or HBM3 – with a capability of 16GB per stack. Nevertheless for yield causes, NVIDIA solely ships their common H100 components with 5 of the 6 HBM stacks enabled. So whereas there’s nominally 96GB of VRAM on every GPU, solely 80GB is on the market on common SKUs.
The H100 NVL, in flip, is the legendary fully-enabled SKU with all 6 stacks enabled. By turning on the 6th HBM stack, NVIDIA is ready to entry the extra reminiscence and extra reminiscence bandwidth that it affords. It would have some materials impression on yields – how a lot is a carefully guarded NVIDIA secret – however the LLM market is outwardly sufficiently big and keen to pay a excessive sufficient premium for practically excellent GH100 packages to make it price NVIDIA’s whereas.
Even then, it ought to be famous that clients aren’t having access to fairly all 96GB per card. Quite, at a complete capability of 188GB of reminiscence, they’re getting successfully 94GB per card. NVIDIA hasn’t gone into element on this design quirk in our pre-briefing forward of immediately’s keynote, however we suspect that is additionally for yield causes, giving NVIDIA some slack to disable unhealthy cells (or layers) inside the HBM3 reminiscence stacks. The web result’s that the brand new SKU presents 14GB extra reminiscence per GH100 GPU, a 17.5% reminiscence enhance. In the meantime the mixture reminiscence bandwidth for the cardboard stands at 7.8TB/second, which works out to three.9TB/second for the person boards.
In addition to the reminiscence capability enhance, in lots of methods the person playing cards inside the bigger dual-GPU/dual-card H100 NVL look loads just like the SXM5 model of the H100 positioned on a PCIe card. Whereas the traditional H100 PCIe is hamstrung some by means of slower HBM2e reminiscence, fewer energetic SMs/tensor cores, and decrease clockspeeds, the tensor core efficiency figures NVIDIA is quoting for the H100 NVL are all at parity with the H100 SXM5, indicating that this card isn’t additional in the reduction of like the traditional PCIe card. We’re nonetheless ready on the ultimate, full specs for the product, however assuming all the things right here is as introduced, then the GH100s going into the H100 NVL would signify the best binned GH100s at present obtainable.
And an emphasis on the plural known as for right here. As famous earlier, the H100 NVL shouldn’t be a single GPU half, however relatively it’s a dual-GPU/dual-card half, and it presents itself to the host system as such. The {hardware} itself relies on two PCIe form-factor H100s which can be strapped collectively utilizing three NVLink 4 bridges. Bodily, that is nearly equivalent to NVIDIA’s present H100 PCIe design – which may already be paired up utilizing NVLink bridges – so the distinction isn’t within the building of the 2 board/4 slot behemoth, however relatively the standard of the silicon inside. Put one other means, you possibly can strap collectively common H100 PCie playing cards immediately, however it wouldn’t match the reminiscence bandwidth, reminiscence capability, or tensor throughput of the H100 NVL.
Surprisingly, regardless of the stellar specs, TDPs stay virtually. The H100 NVL is a 700W to 800W half, which breaks all the way down to 350W to 400W per board, the decrease certain of which is similar TDP because the common H100 PCIe. On this case NVIDIA appears to be like to be prioritizing compatibility over peak efficiency, as few server chassis can deal with PCIe playing cards over 350W (and fewer nonetheless over 400W), that means that TDPs want to face pat. Nonetheless, given the upper efficiency figures and reminiscence bandwidth, it’s unclear how NVIDIA is affording the additional efficiency. Energy binning can go a great distance right here, however it might even be a case the place NVIDIA is giving the cardboard a better than standard enhance clockspeed for the reason that goal market is primarily involved with tensor efficiency and isn’t going to be lighting up the complete GPU without delay.
In any other case, NVIDIA’s resolution to launch what’s basically one of the best H100 bin is an uncommon alternative given their basic choice for SXM components, however it’s a call that is smart in context of what LLM clients want. Massive SXM-based H100 clusters can simply scale as much as 8 GPUs, however the quantity of NVLink bandwidth obtainable between any two is hamstrung by the necessity to undergo NVSwitches. For only a two GPU configuration, pairing a set of PCIe playing cards is far more direct, with the fastened hyperlink guaranteeing 600GB/second of bandwidth between the playing cards.
However maybe extra importantly than that’s merely a matter of with the ability to rapidly deploy H100 NVL in present infrastructure. Quite than requiring putting in H100 HGX provider boards particularly constructed to pair up GPUs, LLM clients can simply toss H100 NVLs in new server builds, or as a comparatively fast improve to present server builds. NVIDIA goes for a really particular market right here, in any case, so the traditional benefit of SXM (and NVIDIA’s skill to throw its collective weight round) could not apply right here.
All instructed, NVIDIA is touting the H100 NVL as providing 12x the GPT3-175B inference throughput as a last-generation HGX A100 (8 H100 NVLs vs. 8 A100s). Which for purchasers trying to deploy and scale up their programs for LLM workloads as rapidly as attainable, is actually going to be tempting. As famous earlier, H100 NVL doesn’t convey something new to the desk when it comes to architectural options – a lot of the efficiency enhance right here comes from the Hopper structure’s new transformer engines – however the H100 NVL will serve a particular area of interest because the quickest PCIe H100 possibility, and the choice with the biggest GPU reminiscence pool.
Wrapping issues up, based on NVIDIA, H100 NVL playing cards will start delivery within the second half of this yr. The corporate shouldn’t be quoting a value, however for what’s basically a prime GH100 bin, we’d anticipate them to fetch a prime value. Particularly in gentle of how the explosion of LLM utilization is popping into a brand new gold rush for the server GPU market.