AMD MI300 Efficiency – Sooner Than H100, However How A lot?
At present MI300X is lastly launched and it’s popping out with a bang. There’s a whole lot of prospects introduced, which we discussed volumes and ASP of here, including folks like Oracle, Meta, and Microsoft. We posted the configuration and architecture back in June, so whereas there are new low degree structure particulars on the finish of this at the moment we’ll principally deal with efficiency, price, and software program. Additionally massive information on the AMD + Broadcom anti-Nvidia alliance.
On uncooked specs, MI300X dominates H100 with 30% extra FP8 FLOPS, 60% extra reminiscence bandwidth, and greater than 2x the reminiscence capability. In fact, MI300X sells extra towards H200, which narrows the hole on reminiscence bandwidth to the one digit vary and capability to lower than 40%. MI300X sadly was solely in a position to hit 5.3TB/s of reminiscence bandwidth as an alternative of the 5.6TB/s initially focused.
In fact FLOPS, capability, and bandwidth are solely potential capabilities. AMD confirmed a number of totally different benchmarks, the principle theme is they’re nonetheless fairly a bit underneath on peak efficiency versus theoretical.
-
FlashAttention2 – That is ahead move solely, IE inference, not coaching. It’s noteworthy as virtually each benchmark AMD shared was ahead move solely. The efficiency benefit is 10% to twenty%, far in need of the uncooked specs.
-
LLAMA2-70B – Once more ahead move just for sure kernels, not the complete mannequin, and once more 10% to twenty% efficiency. These are extra compute sure workloads, not reminiscence sure.
Inference alternatively, AMD confirmed two totally different inference benchmarks, one was excessive batch measurement and throughput, the opposite was lowest latency potential.
-
Bloom – This benchmark is probably the most spectacular of all of them, however we suppose it is without doubt one of the basic methods we’ve got seen different companies do once they have a reminiscence capability benefit. Use a mannequin that hardly suits within the inference system, on this case, Bloom takes a bit over 350GB of reminiscence of the 640GB that the H100 HGX has. Then you definately use a really giant enter sequence size (2k on this case) relative to the output token depend (100). The system with smaller reminiscence measurement is compelled to run with a a lot smaller batch measurement as a result of the KVCache takes up all of the reminiscence capability. In the meantime AMD can use a bigger batch measurement to leverage their compute. To be clear, this can be a actual benefit and the throughput centered situation is actual, however it’s an edge case.
-
LLAMA 2-70B – This can be a extra sensible inference benchmark for many use instances. AMD has a 40% latency benefit which may be very cheap given their 60% bandwidth benefit vs H100. Given H200 comes loads nearer in bandwidth we anticipate it to carry out equally. Be aware AMD used VLLM for Nvidia which is the very best open stack for throughput, however Nvidia’s closed supply TensorRT LLM is simply as straightforward to make use of and has considerably higher latency on H100.
The final benchmark is LLAMA 2 -13B. The efficiency enchancment is 20% right here, not a lot to caveat right here. MI300X is cheaper. H200 probably closes the hole.
On to coaching. AMD exhibits a little bit of weak point from their software program stack right here. They solely achieves lower than 30% of the theoretical FLOPS that MI300 is capabile. In the meantime Nvidia continuously achieves 40%. As such efficiency is missing.
Their efficiency matches Nvidia due to a number of causes. One of many chief causes is that AMD solely will get about half the theoretical FLOPS in uncooked GEMM workloads. The opposite is that FlashAttention2 doesn’t work nicely on the backward move nonetheless. It’s coming, however there are architectural variations that make it robust. AMD’s L1 cache is doubled, however the LDS continues to be the identical measurement. That is nonetheless more durable to make FA2 work versus Nvidia’s bigger sharedmem.
Extra time, we anticipate this to enhance meaningfully. That’s the massive shiny spot to those numbers, we see AMD quickly enhancing.
Typically, we’re watching Triton efficiency getting higher, particularly for uncooked GEMM.
OpenAI is working with AMD in help of an open ecosystem. We plan to help AMD’s GPUs together with MI300 in the usual Triton distribution beginning with the upcoming 3.0 launch.
Philippe Tillet, OpenAI
This can be a massive deal as OpenAI and Microsoft shall be utilizing AMD MI300 closely for inference.
Additionally, to be clear keen mode and torch.compile simply work for many fashions in coaching, high quality tuning, and inference for many current fashions simply work out the field, however what’s missing is efficiency optimization. We see it taking place.
In a handful of months we’d guess AMD’s efficiency retains rising versus the H100. Whereas H200 is a reset, MI300 ought to nonetheless win general with extra software program optimization.
The extra vital factor is OEMs and clouds. Microsoft after all is supporting. Oracle may even be supporting as we famous within the pand additionally they introduced prospects resembling Databricks (MosaicML).
However they aren’t the one ones.