Now Reading
Constructing The Good Reminiscence Bandwidth Beast

Constructing The Good Reminiscence Bandwidth Beast

2023-01-25 13:33:42


If reminiscence bandwidth is holding again the efficiency of a few of your functions, and there’s something that you are able to do about it aside from to only endure. You may tune the CPU core to reminiscence bandwidth ratios by selecting your chips properly, and you may lean on chip makers and system builders to push it even additional.

It’s fascinating to ponder what HPC and AI computing may seem like if CPUs weren’t so restricted on reminiscence bandwidth and in some circumstances reminiscence capability. Or to talk extra exactly, if reminiscence was not so costly relative to compute. We are able to, maybe, do one thing concerning the former and we’ll flip blue within the face and maybe die ready for one thing to occur concerning the latter, as we talked briefly about last week.

Typically, all you are able to do is make a tourniquet and attempt to maintain transferring even if you happen to can’t instantly and completely handle the issue at hand. Or foot or wherever the wound is. Which received us to fascinated by how server patrons nowadays, with some modest tweaks from server CPU and system makers, may not less than get the reminiscence bandwidth per core extra in stability.

It has been getting worse and worse every year for many years, as trade luminary Jack Dongarra, final 12 months’s Turing Award winner, aptly pointed out in his keynote address.

Now we have been fascinated by this for some time, and the preview of the Power10 processor by IBM way back in August 2019 and an anticipated (however by no means delivered) high-bandwidth Power9′ – that’s Power9 “prime” not a typo – system that Big Blue talked to us about in October 2019 whet our urge for food for methods with excessive reminiscence bandwidth. (Let’s name it the Energy E955 so it has a reputation, though it was by no means launched.) IBM was displaying off its OpenCAPI Reminiscence Interface (OMI) and the ensuing reminiscence it has delivered with the Power10 equipment, however this chart encapsulated what IBM believed it might do on a Energy chip socket for varied applied sciences:

IBM’s OMI differential DDR reminiscence, which makes use of a serial interface and a SerDes that’s primarily the identical because the “Bluelink” signaling used for NUMA, NVLink, and OpenCAPI ports on the processor, could be very totally different from the conventional parallel DDR4 interfaces, The precise DDR protocols, be it DDR4 or DDR5, sit out on the buffer chip on the reminiscence card and the interface from the reminiscence card to the CPU is a extra generic OMI protocol.

This OMI reminiscence beneath growth again in 2019 delivered round 320 GB/sec per socket and from 256 GB to 4 TB of capability per socket. And with a bandwidth-optimized model that cuts the reminiscence module depend by an element of 4 and delivers someplace between 128 GB and 512 GB of DDR4 capability per socket, IBM might push the reminiscence bandwidth on the Power9’ chip to 650 GB/sec and with the Power10 servers anticipated in 2021 it might push that as much as 800 GB/sec utilizing DDR5 reminiscence that clocks a bit quicker.

On the identical time, for a Power9’ system anticipated to be delivered in 2020, IBM reckoned if it used HBM2 stacked reminiscence, it might put 16 GB to 32 GB of capability and ship someplace round 1 TB/sec of bandwidth per socket. That’s plenty of reminiscence bandwidth per socket, however it isn’t very a lot reminiscence capability.

For no matter purpose – and we predict no matter they have been, they weren’t good ones, however it probably had something to do with Big Blue’s technical and legal difficulties with then-foundry partner Globalfoundries – the Power9’ system, very probably a four-socket machine with twin chip modules in every socket, by no means noticed the sunshine of day.

However the concept of a “bandwidth beast” was reprised because the Energy E1050 as a part of the Power10 midrange system lineup again in July 2022.

When the “Cirrus” Power10 processor specs were divulged in August 2020, IBM mentioned that the chip had 256 GB/sec of peak reminiscence bandwidth per core and 120 GB/sec of sustained reminiscence bandwidth per core. There are 16 cores on the Power10 die, however to get higher yields on the 7 nanometer course of from Samsung, IBM’s new foundry associate, solely a most of 15 cores are lively. On the entry and midrange Power10 machines that came out last July, 4, 8, 10, and 12 cores can be found within the SKU stack and the 15 core variant is just out there within the high-end “Denali” Energy E1080 system that scales to 16 sockets. It’s not clear if these peak and sustained reminiscence bandwidth figures have been for DDR5 reminiscence, however we suspect so. IBM did ship the Energy E1050 (and different Power10 machines) utilizing OMI reminiscence primarily based on DDR4 reminiscence, and mentioned in its shows that the reminiscence streaming efficiency of the Power10 geared up with DDR5 reminiscence could be 2X that of DDR4 reminiscence.

The comparisons above are for single-chip Power10 modules. Double them up for twin chip modules, after which modify for the downshifted clock speeds which might be required to remain in the identical thermal envelope as the one chip modules.

With the Energy E1050 machine, the server has as much as 4 Power10 DCMs and a complete of 96 cores. These eight chiplets had a complete of eight OMI reminiscence controllers and supported as much as 64 differential DIMMs with DDR4 reminiscence operating at 3.2 GHz and delivered 1.6 TB/sec of mixture bandwidth throughout the cores. That’s 17 GB/sec of reminiscence bandwidth per Power10 core on the peak of 96 cores within the system. However let’s do two issues.

First, let’s minimize the core counts means again. That fats configuration of the Energy E1050 is utilizing 12-core Power10 chips, however there’s a 48-core variant that’s utilizing solely six-core chips. (Sure, that’s solely a 37.5 % yield on the Power10 cores.) That doubles the bandwidth per core as much as 34 GB/sec. And if you happen to shift to DDR5 reminiscence operating at 6.4 GHz, which is dear and probably not broadly out there at a smart value, then you will get the reminiscence bandwidth as much as 68 GB/sec per core.

Now, in principle, if CXL reminiscence extenders have been out there, you would push this actual Energy E1050 even additional, You may burn 48 of the 56 lanes of PCI-Categorical 5.0 bandwidth per socket on CXL reminiscence, including six x8 CXL reminiscence extenders at 32 GB/sec every yields one other 192 GB/sec of reminiscence bandwidth (with some addition latency on it, after all). That will get you to 1.8 TB/sec of mixture bandwidth and 38 GB/sec of bandwidth per core. If IBM made the core counts smaller on every Power10 chiplet, then the reminiscence bandwidth per core could possibly be dialed up. At 4 cores per chip and 32 cores per system, you’re as much as 57.1 GB/sec of reminiscence bandwidth per core. Transferring to DDR5 reminiscence plus CXL reminiscence places you at 84 GB/sec per core.

Enter The Hybrid Compute Engines

Nobody is saying that is low cost, thoughts you. However for sure workloads, it could be a greater reply than porting code to a GPU or ready for the hybrid CPU-GPU compute engines – Instinct MI300A from AMD, “Grace-Hopper” from Nvidia, “Falcon Shores” from Intel – come to market. And whereas these can have excessive reminiscence bandwidth per core, the reminiscence capability will likely be constrained and subsequently will likely be much more restricted than what IBM can do with Power10 and Intel can do with the “Sapphire Rapids” Max Sequence CPU with hybrid HBM 2e/DDR5 reminiscence.

The Nvidia Grace chip has 72 cores and sixteen banks of LPDDR5 reminiscence with an mixture of 512 GB of capability and 546 GB/sec of reminiscence per socket. That works out to 7.6 GB/sec of reminiscence bandwidth per core. The Hopper GPU has 132 streaming multiprocessors – the analog to a core on a CPU – and a most of three,000 GB/sec of bandwidth on its HBM3 stacked reminiscence. (There are 5 stacks yielding 80 GB on the H100 accelerator.) That works out to 22.7 GB/sec of bandwidth per GPU “core” simply to offer you a body of reference. When you deal with all the LPDDR5 reminiscence on Grace as a sort of CXL-like reminiscence, you increase the reminiscence capability of the CPU-GPU advanced to a complete of 592 GB and also you increase the combination reminiscence bandwidth to three,536 GB/sec. Allocate the cores and SMs throughout that advanced as you’ll. You may consider the GPU as a really costly quick reminiscence accelerator for the CPU cores and that works out to 49.3 GB/sec of reminiscence bandwidth per Grace core or 26.9 GB/sec per Hopper SM.

The Power10 system as talked about above is on this ballpark with out a lot in the way in which of engineering.

With the AMD Intuition MI300A, we all know it has 128 GB of HBM3 stacked reminiscence throughout eight banks and throughout six GPUs and two 12-core Epyc 9004 CPU chiplets, however we don’t know the bandwidth and we have no idea the variety of SMs on the gathering of six GPU chiplets on the MI300A bundle. We are able to make an informed guesses on the bandwidth. HBM3 is operating signaling at 6.4 Gb/sec per pin and up to16 channels. Relying on the variety of DRAM chips stacked up (from 4 to sixteen) and their capacities (from 4 GB to 64 GB per stack), you will get varied capacities and bandwidths. Utilizing 16 Gb DRAM, the preliminary HBM3 stacks are anticipated to ship 819 GB/sec of bandwidth per stack. It appears to be like like AMD could be utilizing eight stacks of 16 Gb chips which might be eight chips excessive per stack, which might give 128 GB of capability and would yield 6,552 GB/sec of complete bandwidth on the speeds anticipated when the HBM3 spec was announced last April. We predict that the Epyc 9004 dies on the MI300A bundle have 16 cores, however solely 12 of them are uncovered to extend yields and doubtless clock speeds, and that may work out to an incredible 273 GB/sec per core of reminiscence bandwidth throughout these Epyc cores as they attain into the HBM3 reminiscence. It’s laborious to say what number of SMs are on these six GPU chiplets, however it’ll most likely be a really excessive bandwidth per SM in comparison with prior AMD and Nvidia GPU accelerators. However, once more, 128 GB of complete reminiscence per compute engine isn’t a number of capability.

And, to curb our enthusiasm a bit, for thermal causes, AMD may need to chop again on the DRAM stacks and/or the HBM3 reminiscence speeds and subsequently may not hit the bandwidth numbers we’re projecting. Even at half the bandwidth per CPU core, this is able to be spectacular. And once more, for CPU-only functions, that GPU is a really costly add-on.

Any CXL reminiscence which may hold off this processor so as to add further capability would assistance on this entrance, however wouldn’t add very a lot to the bandwidth per core or SM.

We don’t know sufficient about future Intel Falcon Shores CPU-GPU hybrids to do any math in any respect.

HBM On The CPU And NUMA To The Rescue?

That brings us to Intel’s Sapphire Rapids with HBM2e reminiscence, which additionally has a mode that helps each HBM2e and DDR5 reminiscence on the identical time. Sapphire Rapids is fascinating to us not solely as a result of it helps HBM2e stacked reminiscence in some variations, however as a result of it additionally has eight-way NUMA scalability in different variations.

We predict a case could possibly be made to permit the creation of an eight-way, HBM-capable system that’s goosed with each DDR5 and CXL principal reminiscence. Let’s undergo it, beginning with the common Sapphire Rapids Xeon SP CPU.

See Also

As greatest we will determine, the eight DDR5 reminiscence channels on the Sapphire Rapids Xeon SP can ship only a tad bit greater than 307 GB/sec of reminiscence bandwidth on a socket. With one DIMM per channel operating at 4.8 GHz, that’s 2 TB of most capability. With two DIMMs per channel, you may double the capability per socket as much as 4 TB, however you run at a slower 4.4 GHz and that yields solely 282 GB/sec of reminiscence bandwidth per socket. (This latter situation is for a reminiscence capability beast, not a reminiscence bandwidth beast.) With one DIMM per channel on that top-bin Xeon SP-8490H with 60 cores operating at 1.9 GHz, that works out to a reasonably skinny 5.1 GB/sec of bandwidth per core. When you drop all the way down to the Xeon SP-8444H processor, which has solely 16 cores – however they run at a better 2.9 GHz so that you get a few of the efficiency again you lose by dropping cores – that works 19.2 GB/sec of bandwidth per core.

OK, if you happen to needed to push the reminiscence bandwidth per core on a socket, you would swap to the Xeon SP-6434, which has eight cores operating at 3.7 GHz. That might double the bandwidth per core to 38.4 GB/sec on the 4.8 GHz DDR5 velocity. There may be one fewer UltraPath Interconnect (UPI) hyperlink lively on this processor, so the coupling on the two-socket server could be rather less environment friendly and would have a decrease latency and decrease bandwidth, too. That’s in the identical ballpark as a six-core Power10 chip utilizing DDR4 reminiscence operating at 3.2 GHz and much like what every core on the Grace Arm server CPU will see from its native LPDDR5 reminiscence.

Now, let’s speak concerning the Sapphire Rapids HBM variant. The highest bin Max Sequence CPU has 56 cores and the 4 HBM2e stacks have 64 GB of capability and 1,230 GB/sec of mixture bandwidth. That works out to 22 GB/sec of reminiscence bandwidth per core. The low bin half has 32 cores throughout that very same 1,230 GB/sec of reminiscence, or 38 GB/sec per core. When you add the DDR5 reminiscence on the socket, you may add one other 307 GB/sec and if you happen to add the CXL reminiscence extenders, you get one other 192 GB/sec. So now you’re as much as an mixture of 1,729 GB/sec of reminiscence throughout 32 cores, or 54 GB/sec.

Now, let’s push this to an excessive leveraging the NUMA interconnect to attach collectively eight Sapphire Rapids HBM sockets – which Intel isn’t permitting – and dropping the core counts to eight cores per socket operating at 4 GHz. That yields 64 cores operating at 4 GHz, or a little bit extra oomph than a single Sapphire Rapids 60-core Xeon SP-8490H. However now, with HBM, DDR5, and CXL reminiscence all added in, this eight socket field has an mixture of 13,912 GB/sec of reminiscence bandwidth, and a complete of 217.4 GB/sec per core.

This may not be an inexpensive field, we’re positive. However then once more, neither is the Energy E1050.

And if IBM dialed the cores down on the Energy E1080 and added in CXL extenders, it might get one thing throughout 16 sockets that may be an mixture of 6,544 GB/sec throughout the OMI reminiscence connected to these 16 sockets plus one other 3,072 GB/sec throughout six CXL reminiscence modules on the PCI-Categorical 5.0 bus for a complete of 9,616 GB/sec. How few cores would you like right here? At 4 cores per Power10 SCM, that’s 64 cores, which works out to 150 GB/sec of principal reminiscence bandwidth per core.

The purpose is, there’s a strategy to construct server nodes that concentrate on higher reminiscence bandwidth per core and thus are appropriate for accelerating sure sorts of HPC and analytics workloads, and maybe even a portion of AI coaching workloads. You can be extra compute sure than reminiscence capability or reminiscence bandwidth sure, and you must be very cautious to not have that terribly costly reminiscence not being taxed by not having sufficient cores pulling knowledge out of it and pushing knowledge into it.

By the way in which, we’re much less positive about how this bandwidth beast method may speed up AI coaching – maybe solely on pretrained fashions which might be being pruned and tuned. Now we have a hunch that even GPUs have an imbalance between GPU core flops and connected HBM2e and HBM3 stacked reminiscence bandwidth and that they don’t run at anyplace close to peak computational effectivity due to this.

We totally notice that none of this is able to be low cost. However neither are GPU accelerated machines. However having a greater stability of compute, reminiscence bandwidth, and reminiscence capability could be a greater reply for sure workloads than chopping up reminiscence into items and spreading datasets throughout dozens of CPUs. Admittedly, you actually should need to speed up these workloads otherwise – and to program them throughout a reminiscence hierarchy – to push the boundaries.

That is what thought experiments are for.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top