Now Reading
Aggressive efficiency claims and trade leadin…

Aggressive efficiency claims and trade leadin…

2023-12-16 19:08:01

On December sixth, AMD launched our AMD Instinct MI300X and MI300A accelerators and launched ROCm 6 software stack on the Advancing AI event.

Since then, Nvidia revealed a set of benchmarks evaluating the efficiency of H100 in comparison with the AMD Intuition MI300X accelerator in a choose set of inferencing workloads.

The brand new benchmarks:

  1. Used TensorRT-LLM on H100 as a substitute of vLLM utilized in AMD benchmarks
  2. In contrast efficiency of FP16 datatype on AMD Intuition MI300X GPUs to FP8 datatype on H100
  3. Inverted the AMD revealed efficiency information from relative latency numbers to absolute throughput

We’re at a stage in our product ramp the place we’re constantly figuring out new paths to unlock efficiency with our ROCM software program and AMD Intuition MI300 accelerators. The info that was introduced in our launch occasion was recorded in November.  We have now made plenty of progress since we recorded information in November that we used at our launch occasion and are delighted to share our newest outcomes highlighting these beneficial properties.

The next chart makes use of the most recent MI300X efficiency information working Llama 70B to check:

Figure 1: Llama-70B inference latency performance (median) using Batch size 1, 2048 input tokens and 128 output tokens. See end notes.Determine 1: Llama-70B inference latency efficiency (median) utilizing Batch dimension 1, 2048 enter tokens and 128 output tokens. See finish notes.

  1. MI300X to H100 utilizing vLLM for each.
    1. At our launch occasion in early December, we highlighted a 1.4x efficiency benefit for MI300X vs H100 utilizing equal datatype and library setup. With the most recent optimizations now we have made, this efficiency benefit has elevated to 2.1x.
    2. We chosen vLLM primarily based on broad adoption by the person and developer group and helps each AMD and Nvidia GPUs.  
  2. MI300X utilizing vLLM vs H100 utilizing Nvidia’s optimized TensorRT-LLM
    1. Even when utilizing TensorRT-LLM for H100 as our competitor outlined, and vLLM for MI300X, we nonetheless present a 1.3x enchancment in latency.
  3. Measured latency outcomes for MI300X FP16 dataset vs H100 utilizing TensorRT-LLM and FP8 dataset.
    1. MI300X continues to display a efficiency benefit when measuring absolute latency, even when utilizing decrease precisions FP8 and TensorRT-LLM for H100 vs. vLLM and the upper precision FP16 datatype for MI300X.
    2. We use FP16 datatype on account of its reputation, and at present, vLLM doesn’t assist FP8.

These outcomes once more present MI300X utilizing FP16 is akin to H100 with its greatest efficiency settings really helpful by Nvidia even when utilizing FP8 and TensorRT-LLM.

We sit up for sharing extra efficiency information, together with new datatypes, further throughput-specific benchmarks past the Bloom 176B information we introduced at launch and extra efficiency tuning as we proceed working with our buyer and ecosystem companions to broadly deploy Intuition MI300 accelerators.

###

Finish notes:

General latency for textual content era utilizing the Llama2-70b chat mannequin with vLLM comparability utilizing customized docker container for every system primarily based on AMD inner testing as of 12/14/2023. Sequence size of 2048 enter tokens and 128 output tokens.

See Also

Configurations: 

  1. 2P Intel Xeon Platinum 8480C CPU server with 8x AMD Intuition™ MI300X (192GB, 750W) GPUs, ROCm® 6.0 pre-release, PyTorch 2.2.0 pre-release, vLLM for ROCm, utilizing FP16 Ubuntu® 22.04.3. vs. An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors, 8x Nvidia H100 (80GB, 700W) GPUs, CUDA 12.1., PyTorch 2.1.0., vLLM v.02.2.2 (most up-to-date), utilizing FP16, Ubuntu 22.04.3
  2. 2P Intel Xeon Platinum 8480C CPU server with 8x AMD Intuition™ MI300X (192GB, 750W) GPUs, ROCm® 6.0 pre-release, PyTorch 2.2.0 pre-release, vLLM for ROCm, utilizing FP16 Ubuntu® 22.04.3 vs. An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors, 8x Nvidia H100 (80GB, 700W) GPUs, CUDA 12.2.2, PyTorch 2.1.0, TensorRT-LLM v.0.6.1, utilizing FP16, Ubuntu 22.04.3.
  3. 2P Intel Xeon Platinum 8480C CPU server with 8x AMD Intuition™ MI300X (192GB, 750W) GPUs, ROCm® 6.0 pre-release, PyTorch 2.2.0 pre-release, vLLM for ROCm, utilizing FP16 Ubuntu® 22.04.3. vs. An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors, 8x Nvidia H100 (80GB, 700W) GPUs, CUDA 12.2.2, PyTorch 2.2.2., TensorRT-LLM v.0.6.1, utilizing FP8, Ubuntu 22.04.3.

Server producers might fluctuate configurations, yielding completely different outcomes. Efficiency might fluctuate primarily based on use of newest drivers and optimizations.

 

 

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top