AMD strikes again at Nvidia with new MI300X benchmarks — MI300X exhibits 30% increased efficiency than H100, even with an optimized software program stack
Neither AMD nor Nvidia intends to again out of this argument involving the efficiency distinction between the Intuition MI300X and the H100 (Hopper) GPUs. However AMD does make some sturdy factors whereas evaluating FP16 utilizing vLLM, which is a extra in style selection in opposition to FP8, which works solely with TensorRT-LLM.
The crimson crew introduced the MI300X graphics accelerator early this December, claiming as much as 1.6X lead over Nvidia’s H100. Two days ago, Nvidia fired again by saying AMD didn’t use its optimizations when evaluating the H100 with TensorRT-LLM. The reply reached a single H100 in opposition to eight-way H100 GPUs whereas operating the Llama 2 70B chat mannequin.
The Continued Struggle of Benchmark Outcomes and Take a look at Situations
On this newest response, AMD stated that Nvidia used a selective set of inferencing workloads. It additional recognized that Nvidia benchmarked these utilizing its in-house TensorRT-LLM on H100 fairly than vLLM, an open-source and extensively used methodology. Moreover, Nvidia used the vLLM FP16 efficiency datatype on AMD whereas evaluating its outcomes with DGX-H100, which used the TensorRT-LLM with FP8 datatype to show these alleged misconstrued outcomes. AMD confused that in its check, it used vLLM with the FP16 dataset as a consequence of its widespread use, and vLLM doesn’t assist FP8.
There’s additionally the purpose that servers can have latency, however as an alternative of accounting for that, Nvidia confirmed its throughput efficiency, not emulating the real-world state of affairs, in line with AMD.
AMD’s Up to date Take a look at Outcomes with Extra Optimizations and Accounting for Latency with Nvidia’s Testing Methodology
AMD made three efficiency runs utilizing Nvidia’s TensorRT-LLM, the final notable one having measured latency outcomes between MI300X and vLLM utilizing the FP16 dataset in opposition to H100 with TensorRT-LLM. However the first check concerned a comparability between the 2 utilizing vLLM on each, therefore FP16, and for the second check, it in contrast its MI300X’s efficiency with vLLM whereas evaluating TensorRT-LLM.
So, AMD used the identical chosen testing situation Nvidia did with its second and third check situations, exhibiting increased efficiency and decreased latency. The corporate added extra optimizations when in comparison with H100 whereas operating vLLM on each, providing a 2.1x increase in efficiency.
It’s now as much as Nvidia to guage the way it desires to reply. Nevertheless it additionally must acknowledge that this could require the business to ditch FP16 with TensorRT-LLM’s closed system for utilizing FP8, basically ditching vLLM for good. Whereas referring to Nvidia’s premium, a Redditor once said, “TensorRT-LLM is free similar to the issues that come free with a Rolls Royce.”