Zen 4 Will get VCache – Chips and Cheese

Compute efficiency has been held again by reminiscence efficiency for many years, with DRAM efficiency falling even additional behind with yearly. Caches compensate for this by attempting to maintain ceaselessly used knowledge nearer to the CPU core. As DRAM efficiency grew to become increasingly insufficient, caching setups grew extra subtle to manage. The late 2000s noticed CPUs transition to a triple degree cache hierarchy, to be able to get a number of megabytes of on-chip cache.
CPU | Information Caching |
1995: Intel Pentium Professional at 200 MHz | L1: 16 KB knowledge, 3 cycle latency L2: 256 KB, chiplet, ~7 cycle latency DRAM: EDO, ~430 ns, ~86 cycles |
2006: AMD Athlon 64 FX-62 at 2.8 GHz | L1: 64 KB Information, 3 cycle latency L2: 1 MB on-die, 12 cycle latency DRAM: DDR2-800, 60 ns, 168 cycles |
2010: Intel Xeon X5650 at 3.067 GHz | L1: 32 KB Information, 4 cycle latency L2: 256 KB, 10 cycle latency L3: 12 MB on-die, 46 cycle latency DRAM: DDR3-1333, 70.3 ns, 215 cycles |
2020: AMD Ryzen 5800X3D at 4.5 GHz | L1: 32 KB Information, 4 cycle latency L2: 512 KB, 12 cycle latency L3: 96 MB, stacked chiplets, 49 cycle latency DRAM: DDR4-3200, 78.72 ns, 354 cycle latency |
AMD launched the Ryzen 7 5800X3D roughly a 12 months in the past, with a 64 MB cache die stacked on prime of the common core die. This wasn’t the primary try to dramatically enhance caching capability. However VCache stands out as a result of it combined high capacity caching with very high performance. On the patron aspect, the 5800X3D carried out very properly in gaming workloads, whereas VCache-equipped Milan chips focused HPC functions.
This 12 months, AMD is attempting to repeat Zen 3’s “VCache” success of their new Zen 4 lineup. Zen 4 advantages from each an improved core structure, and a more recent 5 nm course of node. For the L3 cache, 5 nm means a dramatic space discount. The column of TSVs (through-silicon vias) used to attach stacked cache additionally noticed a pleasant space discount, despite the fact that IO often sees much less achieve from course of node shrinks.
AMD has additionally expanded VCache’s scope inside their shopper lineup. As an alternative of only one VCache SKU restricted to eight cores, Zen 4 will get VCache in larger core depend chips as properly. Nevertheless, there’s a twist. SKUs with greater than eight cores solely have VCache applied on one of many core chiplets, that means that solely half of the cores profit from elevated L3 dimension. This habits is considerably awkward for normal customers, however offers a wonderful alternative for us to isolate the impact of elevated L3 capability.
Right here, we’re testing the 7950X3D, which affords 16 cores throughout two CPU chiplets (CCDs). The primary chiplet has 96 MB of L3 cache, whereas the second has a daily 32 MB L3 setup.
Clock Conduct
AMD’s stacked cache is much less tolerant of excessive voltage, stopping the VCache cores from boosting as excessive as their vanilla counterparts. This creates a tough optimization drawback, as a result of one CCD shouldn’t be clearly higher than the opposite. If an utility already sees excessive hitrate with a 32 MB L3, or has a scorching working set so giant that going to 96 MB of L3 makes a minimal distinction, it might carry out higher on the non-VCache CCD. Conversely, a big L3 hitrate enhance from VCache might present sufficient efficiency per clock achieve to outweigh the second CCD’s frequency benefit.
Addition latency is one cycle on just about all CPUs, so we’re utilizing dependent register to register additions to measure every core’s most enhance clock. VCache cores typically stayed across the 5.2 GHz vary, whereas regular cores might attain 5.5 GHz and past. The very best core on the non-VCache CCD got here inside margin of error of the marketed 5.7 GHz enhance clock.
On common, the conventional CCD clocks round 7% larger. For comparability, the common 7950X had a couple of 3% clock distinction between its two CCDs. Whereas it is a lot of clock velocity variation, it’s higher than the VCache clock variation on Zen 3. The very best clocking Zen 3 VCache half, the Ryzen 7 5800X3D, had a 4.5 GHz enhance clock. With out VCache, the highest finish Ryzen 9 5950X might attain 5.05 GHz, or round 10% larger clocks. Zen 4 enjoys larger absolute clock speeds too. Each single core on the VCache-enabled chiplet might clock comfortably above 5 GHz, offering a clock velocity benefit over all Zen 3 SKUs.
An attention-grabbing apart, the 7950X3D can clock larger than the 5800X3D, even with a BIOS model that doesn’t nominally help Zen 4’s VCache. ASRock’s site states that the B650 PG Lightning will get VCache help with BIOS model 1.18. Conveniently for me, my ASRock B650 PG Lightning board was completely satisfied besides with the 7950X3D utilizing an previous BIOS model from the manufacturing facility. Evidently it didn’t absolutely help VCache, as a result of VCache-enabled cores have been capped at 4.63 GHz, whereas common cores might solely hit 4.78 GHz. Each speeds are nonetheless fairly excessive in an absolute sense, and are a bit above the 5800X3D’s 4.5 GHz enhance clock. That stated, a BIOS replace is certainly really useful, even when the CPU boots out of the field.
VCache Latency
Zen 4’s VCache implementation incurs a minor latency penalty. We are able to isolate cycle depend variations by testing pointer chasing latency with enhance off, after which multiplying outcomes by the 4.2 GHz base clock. A core with VCache takes about 4 further cycles to get knowledge from L3, which isn’t a nasty tradeoff for getting a a lot bigger L3 cache.
The small cycle depend distinction doesn’t inform the complete story, as a result of the true latency penalty is made worse by VCache’s clock velocity deficit. However taking 1.61 ns longer isn’t too dangerous for getting triple the L3 capability.
VCache Bandwidth
Similar to Zen 3’s VCache implementation, the stacked cache works by giving every L3 cache slice further capability. The on-CCD interconnect stays largely unchanged, with a bidirectional ring connecting cores to the cache slices. Subsequently, we don’t count on vital bandwidth variations past the hit from diminished clock velocity.
Beginning with a single core, the 7950X3D’s non-VCache CCD beneficial properties a 11% L3 bandwidth lead, which is sort of in step with the clock velocity distinction. If we flip off enhance and divide by the 4.2 GHz base clock to get bytes per cycle, we see practically similar L3 bandwidth.
Loading all cores within the CCD reveals that VCache impacts clocks underneath heavy load too, despite the fact that voltages are decrease than with single core enhance. When studying from a L3-sized area, the VCache CCD sustained 4.8 GHz. The non-VCache held 5.15 GHz.
Curiously, the bandwidth distinction is a bit bigger than clock velocity variations alone would counsel. With enhance turned off and all cores working at 4.2 GHz, we will divide by clock frequency to get bandwidth in bytes per cycle.
VCache certainly has barely decrease bandwidth than its vanilla counterpart. Per core, we’re averaging 18.45 bytes per cycle on the VCache CCD, in comparison with 20.8 on the common one. I’m undecided why that is the case. I in contrast the outcomes with efficiency counters by multiplying L3 hit depend by 64 bytes. Measured bandwidth and bandwidth reported by efficiency counters line up.
L3 Hitrate Examples
Efficiency counters are helpful for greater than sanity-checking check outcomes, as a result of we will additionally use them to see how properly VCache offers with varied packages. Particularly, we’ll be monitoring L3 hitrate from the L3 cache’s perspective, and never the core perspective. Meaning we’re counting speculative L3 accesses, and never distinguishing between prefetch and demand requests.

Enhance is disabled, retaining clocks capped at 4.2 GHz for consistency. The GPU used is a RDNA 2 primarily based Radeon RX 6900 XT, which additionally had its clock velocity set to 2 GHz for consistency. We’re not going to be taking a look at absolute framerates right here. Quite, efficiency counters might be used to take a look at IPC variations. Different websites have already got loads of knowledge on how Zen 4’s VCache performs throughout quite a lot of eventualities, so we received’t rehash that.
GHPC
GHPC is an indie recreation that tries to be user-friendly whereas precisely depicting hearth management methods on varied armored autos. It additionally helps you to take away turrets from T-72s with implausible visible results.
This recreation can endure from each CPU and GPU bottlenecks. It additionally doesn’t load plenty of cores, which suggests it may be locked to at least one CCD with out worries about working out of cores to run the sport’s threads. As a result of visible results (particularly smoke) are fairly heavy on the GPU aspect, I’ll be utilizing IPC to isolate CPU efficiency.
VCache offers a notable 33% L3 hitrate enhance right here. Bringing common hitrate to 78% is greater than sufficient to compensate for the slight L3 latency enhance. GHPC enjoys a 9.67% IPC achieve from working on the VCache CCD, so the opposite CCD ought to fall quick even with its larger clock velocity.
Cyberpunk 2077
We’re utilizing the in-built benchmark with raytracing set to extremely, as a result of it’s repeatable. The benchmark run finally ends up being GPU sure, partially as a result of the GPU is about to constant clocks. Cyberpunk solely sees a 0.5 FPS achieve from utilizing VCache. However that’s not the purpose right here.
With affinity set to the VCache CCD, IPC elevated from 1.26 to 1.43. That’s a 13.4% enhance, or principally a generational soar in efficiency per clock. VCache actually turns in a wonderful efficiency right here. L3 hitrate with VCache is 63.74% – respectable for a recreation, however not the most effective in absolute phrases. Subsequently, there’s nonetheless loads of room for enchancment. Fashionable CPUs have plenty of compute energy, and DRAM efficiency is up to now behind that plenty of that CPU functionality is left on the desk. Cyberpunk 2077 is a superb demonstration of that.
DCS
Digital Fight Simulator (DCS) is a airplane recreation. Right here, we’re working a scenaro with plenty of planes, plenty of ships, and perhaps a couple of missiles. The sport is left in map view, and not one of the autos are player-controlled to maintain issues constant.
Additionally, the simulation is sped as much as enhance CPU load. We’re additionally testing utilizing the open beta’s multithreading construct, which actually doesn’t appear very multithreaded. Through the simulation, there have been two to 3 cores loaded.
VCache enjoys larger hitrate once more as we’d count on, however this time the profit is extraordinarily small as a result of DCS barely noticed any L3 misses to start with. Even with out VCache, L3 misses per 1K directions (MPKI) was simply 0.35. VCache brings that right down to 0.28, however there’s not plenty of room to enhance hitrate on this state of affairs. DCS really finally ends up dropping a little bit of IPC, probably attributable to elevated L3 latency. Admittedly it is a very restricted check that doesn’t mirror precise gameplay, however it does present a state of affairs the place cache latency can matter greater than capability.
COD Black Ops Chilly Conflict
COD (Name of Obligation) Black Ops Chilly Conflict is an installment within the in style Name of Obligation collection. I’m testing it by taking part in rounds of Outbreak Collapse in Zombies mode, which has plenty of zombies working across the display screen. This isn’t a wonderfully repeatable state of affairs as a result of it entails multiplayer, so there might be extra margin of error on this case.
Zen 4’s regular 32 MB cache suffers closely on this recreation, consuming a staggering 8.66 MPKI whereas hitrates common underneath 50%. VCache mitigates the worst of those points. Hitrate goes up by 47%, whereas IPC will increase by over 19%.
COD additionally suffers from very low IPC in comparison with the opposite video games examined above. When IPC is that this low, you need to dive a bit deeper. Thankfully, Zen 4 has a brand new set of efficiency monitoring occasions designed to account for misplaced efficiency on a per-pipeline-slot foundation. Zen 4’s renamer is the narrowest a part of the pipeline (identical to the previous few generations of Intel and AMD cores), so many of the occasions concentrate on misplaced slots at that stage.

VCache makes Zen 4 much less backend-bound, which is smart as a result of the execution models might be higher fed with the next cache hitrate. Nevertheless, the core suffers closely from frontend bottlenecks. VCache appears to have little impact on frontend efficiency, suggesting that many of the L3 hitrate achieve got here from catching extra data-side misses.
libx264 Video Encoding
Gaming is enjoyable, however it’s not the one factor we do with computer systems. Software program video encoding is a superb solution to get excessive compression effectivity, on the expense of encoding time. H264 is a particularly in style video codec, with widespread {hardware} decode help. Right here, we’re transcoding an Overwatch POTG clip utilizing libx264’s veryslow preset.
Hitrate improves by 16.75%, going from 61.5% to 72.8%. That’s a measurable and vital enhance in hitrate, however like DCS, libx264 doesn’t endure plenty of L3 misses within the first place. It’s not fairly as excessive at 1.48 L3 MPKI with the non-VCache CCD. However for comparability, Cyberpunk and GHPC noticed 5 and 5.5 L3 MPKI respectively. We nonetheless see a 4.9% IPC achieve, however that’s not nice when the common CCD clocks 7% larger. Efficiency doesn’t scale linearly with clock velocity, however that’s principally as a result of reminiscence entry latency falls additional behind core clock. However given libx264’s low L3 miss fee, it’ll in all probability come shut.
libx264 due to this fact finally ends up being demonstration of why AMD is barely placing VCache on one CCD. Apart from the fee advantages, the non-VCache CCD can present higher efficiency in some functions. In fact, the scenario might be completely different in servers, the place energy limitations might forestall common cores from clocking larger.
7-Zip File Compression
7-Zip is a really environment friendly file compression program. It comes with a in-built benchmark, however as earlier than, we’re going to be compressing a 2.67 GB file as an alternative of utilizing the benchmark. The in-built benchmark runs by way of each compression and decompression, and I don’t care in regards to the latter as a result of it’s quick sufficient. Compression takes plenty of CPU cycles, and is extra price taking a look at.
With affinity set to the VCache CCD, we see a 29.37% hitrate enchancment. IPC will increase by 9.75%, placing it in-line with GHPC. It is a superb efficiency for VCache, and exhibits that elevated caching capability can profit non-gaming workloads. Nevertheless, AMD’s default coverage is to put common functions on the upper clocked CCD. Customers should manually set affinity if they’ve a program that advantages from VCache.
Evaluating Excessive Capability Cache Setups
As famous earlier than, VCache shouldn’t be the primary try at delivering huge caching capability to shopper platforms. Years in the past, Intel used an EDRAM chiplet to implement a 128 MB L4 cache on Haswell, Broadwell, and Skylake CPUs. This resolution provided spectacular capability, however suffered from poor efficiency. Latency is extraordinarily excessive at over 30 ns.
EDRAM bandwidth is equally unimpressive, at round 50 GB/s. On one hand, 50 GB/s is best than what a twin channel DDR3 setup can provide. On the opposite, it’s solely 1 / 4 of Broadwell’s L3 bandwidth.
In comparison with VCache, EDRAM suffers from larger latency as a result of attending to the EDRAM controller entails one other journey over the ring bus. Then, knowledge must be transferred over Intel’s OPIO hyperlinks, that are optimized extra for low energy than excessive efficiency. OPIO can also be partially responsible for EDRAM’s decrease bandwidth. It principally has a full-duplex 64-bit interface working at 6.4 GT/s (on Haswell), offering 51.2 GB/s in every path. In equity, Zen 4’s cross-die Infinity Cloth hyperlinks barely do better, offering 64 GB/s of learn bandwidth per CCD and half as a lot write bandwidth. However that’s an indication that on-package traces aren’t properly positioned to deal with L3 bandwidth necessities.
With TSVs and hybrid bonding expertise, AMD was capable of get orders of magnitude higher pin counts. Every nominally 4 MB L3 slice will get its personal interface to a 8 MB extension, which comes full with tags and LRU arrays. Zen 4’s L3 cache controller in all probability determines whether or not knowledge’s cached on the stacked die or base die by checking a subset of the deal with bits. Then, it checks a set of 16 tags retrieved both from the bottom die or cache die, and returns knowledge from the respective supply if there’s successful. This tightly built-in extension to the L3 cache means the stacked die finally ends up servicing plenty of L3 requests, however TSVs make cross-die accesses low cost sufficient that it’s not an issue.
Broadwell, in distinction, has to make use of EDRAM as L4 as a result of it’s too gradual to get tightly built-in into the L3. The L3 stays in place to insulate cores from excessive EDRAM latency. To chop latency down considerably, Broadwell uses some of the L3 cache’s data array as tags for the L4. This scheme lets the L3 controller work out whether or not a miss might be happy from EDRAM, with out taking a visit throughout dies. For comparability, AMD’s VCache die comprises devoted tag and LRU arrays, probably with specialised SRAM macros for larger efficiency.

Skylake makes use of a distinct scheme the place the EDRAM controller is positioned within the system agent, alongside devoted tags. One way or the other, Skylake’s EDRAM is even worse than Broadwell’s, with longer latency and practically similar bandwidth. In any case, EDRAM is essentially much less performant than the SRAM used on AMD’s stacked cache. Similar to different types of DRAM, EDRAM requires refreshes and doesn’t run significantly quick. The decrease efficiency cross-die interface additionally dictates its use as a L4 cache, as an alternative of an extension of L3.
Intel could also be planning to convey again a L4 cache in Meteor Lake. A patch to the Linux kernel driver for Meteor Lake’s iGPU suggests there is a L4. Intel revealed that Meteor Lake would use stacked dies in Sizzling Chips. Superior packaging expertise might assist this L4 obtain far larger bandwidth than EDRAM, however I’m anxious that it’s nonetheless a L4 cache. As a separate cache degree, it’ll in all probability be much less tightly built-in and endure decrease efficiency than VCache. I believe it’ll be extra iGPU oriented, serving to to alleviate DRAM bandwidth bottlenecks whereas not doing so properly on the latency entrance. GPUs are much less delicate to latency than CPUs, however excessive latency will cut back its potential on the CPU aspect.
Remaining Phrases
Zen 4’s VCache implementation is a superb follow-on to AMD’s success in stacking cache on prime of Zen 3. The tradeoffs in L3 latency could be very minor compared to the huge capability enhance, that means that VCache can present an absolute efficiency benefit in fairly few eventualities. Zen 4’s bigger L2 additionally places it in a greater place to tolerate the small latency penalty created by VCache, as a result of the L2 satisfies extra reminiscence requests with out having to incur L3 latency. The outcomes communicate for themselves. Whereas we didn’t check plenty of eventualities, VCache supplied an IPC achieve in each one in all them. Generally, the additional caching capability alone is sufficient to present a generational leap in efficiency per clock, with none modifications to the core structure.
On the identical time, VCache doesn’t at all times present an absolute efficiency benefit, largely due to the clock velocity discount. Subsequently, AM5 doesn’t have a transparent top-of-the-line Zen 4 configuration. VCache configurations do properly for packages that profit from giant cache capability will increase, whereas the extra normal configuration is best for functions that like larger clocks. In fact, this places customers in a tough scenario. It’s onerous to inform what a program will from profit with out benchmarking it, and one person would possibly have to run each kinds of functions on the identical system.
Sadly, there’s been a development of uneven efficiency throughout cores as producers attempt to make their chips greatest at all the pieces, and use completely different core setups to cowl extra bases. VCache is actually a really delicate instance of this. Should you put the improper utility on the improper core, you’re in all probability not going to finish up with an annoyingly giant efficiency distinction. Wanting past AMD, Intel has been mixing up core setups as properly. Alder Lake and Raptor Lake see even bigger efficiency variations throughout cores on a chip. Specialised architectures and core setups are going to grow to be more and more normalized as producers attempt to chase down each final little bit of efficiency.
Should you like our articles and journalism, and also you wish to help us in our endeavors, then think about heading over to our Patreon or our PayPal if you wish to toss a couple of bucks our manner. If you want to speak with the Chips and Cheese employees and the folks behind the scenes, then think about becoming a member of our Discord.