Van Gogh, AMD’s Steam Deck APU – Chips and Cheese
Zen 2’s launch was a defining second for AMD. For the primary time in lots of, a few years, AMD’s single thread efficiency might go face to face with Intel’s greatest. Zen 2 additionally began a pattern the place AMD introduced as much as 16 cores to desktop CPUs, giving customers very robust multithreaded efficiency with out having to purchase HEDT platforms.
However Zen 2 was additionally versatile and did an excellent job of scaling all the way down to decrease energy targets. That was very true when Zen 2 cores had been carried out in additional energy environment friendly monolithic dies. Probably due to that, the Zen 2 structure went on to serve in new merchandise even after Zen 3’s late 2020 launch. In 2021, Lucienne merchandise just like the Ryzen 7 5700U used Zen 2 cores in an up to date platform. Mendocino launched in 2022 to focus on low energy laptops, with Zen 2 cores ported to TSMC’s 6 nm course of. Van Gogh is yet one more instance. It’s a somewhat distinctive product that mixes RDNA 2 with 4 Zen 2 cores on TSMC’s 7 nm course of.
Not like Lucienne and Medocino, Van Gogh isn’t a common goal laptop computer chip. It solely exhibits up within the Steam Deck, an ultraportable gaming console with a type issue that’s vaguely akin to Nintendo’s Change. In fact, the Steam Deck is supposed to run PC video games, which may be considerably extra demanding. Not like its Zen 2 cousins, the Van Gogh chip within the Steam Deck didn’t get a correct title. It as an alternative reviews itself as “AMD Customized APU 0405”.
Configuration
The Steam Deck has 16 GB of LPDDR5: two Samsung chips with 8 GB of capability every. They’re organized in 4 32-bit channels and run at 5500 MT/s, which ought to give 88 GB/s of theoretical bandwidth. The motherboard known as “Valve Jupiter”. It connects the APU to an x4 M.2 slot, and offers x1 PCIe hyperlinks to a micro-sd card controller and a Realtek 8822CE WiFI card.
Energy to the APU is supplied by three VRM levels, managed by a MP2845. Two of the levels are marked 8690 2823 B, whereas a 3rd is marked 8690 3000 C. The VRM is presumably cut up right into a two-stage part and a separate single-stage one. Subsequently, the VRM is somewhat weak, however that’s not an enormous deal contemplating the APU seems to be capped at 16 W. That energy funds is flexibly allotted between the CPU and GPU. For instance, a GPU-bound sequence might see the GPU pulling over 10 W, whereas the CPU aspect will get squished under base clock and attracts 2-3 W. The other applies for a CPU-only workload.
This kind of versatile energy allocation can work nicely when a sport is primarily CPU sure or GPU sure. But it surely does go away efficiency on the desk in case you’re making an attempt to make use of each the CPU and GPU collectively to maximise compute throughput. Sometimes, you see that occur with compute purposes, like renderers and photograph processing purposes. The Steam Deck doesn’t primarily goal that, so it ought to be superb except a sport pegs each the CPU and GPU on the identical time.
CPU Aspect
Van Gogh options 4 Zen 2 cores in a single cluster (CCX), with a 3.5 GHz enhance clock and a couple of.8 GHz base clock. I’m going to concentrate on system degree stuff right here, and canopy Zen 2’s core structure individually in one other article. From transient testing I didn’t see any variations from a traditional Zen 2 core. There’s no lower down FPU like on the PS5, for instance.
A core-to-core latency take a look at confirms that there’s a single CCD, with a quad core, eight thread configuration.
Caching
Like Renoir, Van Gogh’s CCX solely has 4 MB of L3 cache. Desktop and server Zen 2 variants function 16 MB of L3 cache per CCX, which helps insulate the cores from gradual reminiscence and customarily improves efficiency.
Valve is doing one thing humorous of their OS, as a result of the L3 is mainly lacking from a latency take a look at with default settings. Setting the scaling governor to efficiency somewhat than the default schedutil makes one thing resembling a L3 present up, however efficiency remains to be very poor. L3 efficiency is cheap below Home windows, indicating it’s not a defect within the APU.
Van Gogh’s L1 and L2 caches carry out simply as you’d anticipate from any Zen 2 CPU. Like with Renoir, we see 4 cycles of L1D latency, and 12 cycles of L2 latency. Each cell chips see an enormous L3 capability deficit in comparison with a excessive finish desktop Zen 2 implementation.
A small L3 hurts Renoir, however Van Gogh sees a particular degree of ache as a result of LPDDR5 latency is abysmal. Even servers lately don’t take 150 ns to get knowledge from reminiscence. Not like the L3, outcomes remained constantly poor even in Home windows 11. We’re most likely seeing a severe challenge with the reminiscence controller somewhat than OS energy saving weirdness.
The Ryzen 7 4800H laptop computer examined was geared up with DDR4-3200 22-22-22-52. These JEDEC timings aren’t as tight as what you would possibly discover on a typical desktop DDR4 package, however the 4800H nonetheless exhibits off a a lot better latency outcome than Van Gogh. I assume one thing needed to give for the L a part of LPDDR5.
Bandwidth
L3 points appear to be gone with a bandwidth take a look at, with related outcomes utilizing Home windows or Linux. We see over 200 GB/s of L3 bandwidth with an all-thread load. Bandwidth due to this fact appears to be like superb, even when it’s a bit decrease than that of different Zen 2 implementations due to clock pace variations.
Nonetheless, LPDDR5 once more turns in a disappointing efficiency. Actually, a wide range of elements imply that getting full theoretical bandwidth out of any DRAM configuration is a pipe dream. For instance, you’ll lose reminiscence controller cycles from read-to-write turnarounds and web page misses. However 25 GB/s is on the incorrect planet.
I anticipated higher efficiency out of a 128-bit LPDDR5-5500 setup. The chips themselves are rated for 6400 MT/s, which means that theoretical reminiscence bandwidth is completely wasted from the CPU aspect.
To place extra perspective into simply how unhealthy that is, Renoir’s DDR4-3200 setup beats Van Gogh’s by a large margin. That applies even once I used course of affinity to restrict my take a look at to a single CCX. 25 GB/s is one thing out of the early DDR4 days. For instance, a Core i5-6600K can pull 27 GB/s from a twin channel DDR4-2133 setup.
The LPDDR5 setup due to this fact saddles the CPU with rubbish reminiscence latency, whereas offering bandwidth on par with a DDR4 setup out of late 2015. It’s not an enormous step up from DDR3 setup both. All that’s made worse by the CPU’s small L3 cache, which suggests the cores are much less insulated from reminiscence than they’d be on a desktop or server Zen 2 implementation.
For much more perspective, we are able to take a look at reminiscence bandwidth utilization in Cyberpunk 2077. The sport was run with raytracing off, permitting framerates to hover round 100 FPS. I’m utilizing undocumented efficiency counters, however I’ve examined by pulling a recognized quantity of information from reminiscence and checking to verify counts are affordable.
Even desktop Zen 2 with 16 MB of L3 would discover itself needing greater than 25 GB/s. Much less L3 capability means even increased reminiscence bandwidth demand. Van Gogh is clearly not optimized to get probably the most out of its CPU cores. It’s beginning to really feel like a smaller console APU, the place engineering effort is targeted on the GPU aspect, somewhat than the CPU.
Clock Ramp Habits
CPUs don’t run at most clock on a regular basis, and that particularly applies for a cell machine. As an alternative, they enhance clocks in response to load. That clock ramping course of can take time. Typically most consumer gadgets go to most clock as quick as potential to ship excessive responsiveness. The Steam Deck doesn’t do that. Clock speeds begin at 1.4 GHz, after which attain 1.7 GHz in 0.27 milliseconds. That’s begin and exhibits the APU can command clock modifications fairly shortly. Nonetheless, it sits at 1.7 GHz for a whole bunch of milliseconds earlier than slowly stepping up clock speeds. It doesn’t attain most clocks till almost a second.
This type of enhance habits is horrible for a consumer machine. It’s going to really feel loads much less responsive than different Zen 2 techniques. For comparability, Renoir reaches its most enhance clock in 9.35 ms. Even older CPUs like Piledriver and Haswell can attain most clocks in lower than 100 ms, which makes me assume this enhance coverage is deliberate. Habits is equally unhealthy on each Home windows and SteamOS. Maybe Valve did so with a purpose to prolong battery life on the expense of responsiveness. In any case, the Steam Deck is designed for lengthy working duties like video games. Net shopping isn’t a major use case.
Van Gogh’s GPU
Valve’s Steam Deck is a gaming-first machine, so the GPU deserves consideration. Just like the chip as a complete, AMD didn’t give the GPU a elaborate title. When you question it from OpenCL or Vulkan, it tells you it’s the AMD Customized GPU 0405. That is apparently an RDNA 2 derived GPU with 512 FP32 lanes, or 4 WGPs. It runs at as much as 1.6 GHz, which is a really low clock pace for a RDNA 2 GPU. For perspective, the RX 6900 XT can run at 1.7 GHz with an extremely low 0.8 V core voltage. Low energy positively takes priority over absolute efficiency, and even hitting the structure’s effectivity candy spot. Normal software program assist was not a precedence both. To make my life onerous, the AMD Customized GPU 405 lacks OpenCL assist. So, I’m utilizing Nemes’s Vulkan-based take a look at suite.
The AMD Customized GPU 0405 has a RDNA model cache setup, which means it has an additional cache degree in comparison with Renoir’s Vega iGPU. There’s 16 KB first-level vector and scalar caches, backed by a 128 KB L1. Like Renoir, Van Gogh makes use of a disproportionately giant 1 MB L2 cache to insulate the GPU from DRAM. If we maintained the identical L2 cache to compute ratio as AMD’s RX 6900 XT, a GPU with 4 WGPs would have lower than 512 KB of L2.
RDNA’s architectural benefits are on full show. Vector reminiscence entry latency is a lot better than Vega’s. Vega is extra aggressive from the scalar aspect, with almost an identical scalar cache entry latency. However RDNA’s 128 KB L1 ought to nonetheless give it an edge as a result of each iGPUs have related L2 latency. Once more, that’s spectacular for Van Gogh as a result of it’s capable of preserve the identical L2 latency whereas getting the benefit of a 128 KB L1 mid-level cache.
DRAM latency is once more horrible. RDNA 2’s reminiscence latency usually compares nicely to GCN-based architectures. However LPDDR5 latency is a nonstop shitshow, and the GPU aspect isn’t immune. Fortunately, bandwidth is a lot better. With a GPU bandwidth take a look at, the LPDDR5 controller lastly redeems itself and achieves one thing near what it ought to be able to on paper. With over 70 GB/s of bandwidth, the Customized GPU 0405 will get a large bandwidth lead over Renoir’s iGPU. That strains up with Van Gogh being a gaming targeted product.
Van Gogh’s bandwidth benefit actually can’t be understated. Built-in GPUs usually endure from bandwidth limitations as a result of they’re constructed into chips that emphasize CPU efficiency. Trendy CPUs rely closely on cache and customarily don’t want plenty of reminiscence bandwidth to carry out nicely, at the very least till you get to very excessive core counts and specialised purposes. GPUs are one other story as a result of their working units are usually a lot bigger, usually making efficient caching tough. Worse, built-in GPUs need to battle with the CPU for reminiscence bandwidth.
GPU | Configuration | Reminiscence Bandwidth | Bandwidth Per FLOP (Bytes Per FLOP) |
AMD Customized GPU 0405 | 512 FP32 Lanes at 1.6 GHz, 1.64 TFLOPs | 128-bit LPDDR5-5500, 88 GB/s | 0.054 |
AMD Radeon 7 (Renoir) | 448 FP32 Lanes at 1.6 GHz, 1.43 TFLOPs | 128-bit DDR4-3200, 51.2 GB/s | 0.036 |
AMD Radeon RX 6900 XT | 5120 FP32 Lanes at 2.5 GHz, 25.6 TFLOPs | 256-bit GDDR6-16000, 512 GB/s | 0.02 |
Xbox Collection X GPU | 3328 FP32 Lanes at 1.8 GHz, 12.1 TFLOPs | 320-bit GDDR6-14000, 560 GB/s | 0.046 |
To counter this, the Steam Deck’s LPDDR5 setup offers compute-to-bandwidth ratios akin to that of consoles. Large reminiscence bandwidth means there’s no want for an Infinity Cache. It additional means the GPU ought to have loads of bandwidth out there even when it’s sharing a reminiscence bus with the CPU. Lastly, Van Gogh’s reminiscence bandwidth technique offers an fascinating distinction to desktop GPUs, that are utilizing bigger caches as an alternative of huge VRAM bandwidth. It appears to be like like DRAM expertise remains to be ok to feed the bandwidth calls for of small GPUs with out incurring huge energy prices.
If we lock desktop RDNA 2 to the identical clocks, we see Van Gogh behaving very equally proper as much as L2. There, Van Gogh’s smaller L2 is definitely sooner at matched clocks. Checking a smaller L2 might be simpler. A small GPU just like the one on Van Gogh additionally has fewer L2 purchasers, additional simplifying issues and permitting for latency optimizations.
Previous the L2, we are able to see desktop RDNA 2’s Infinity Cache. It’s prominently lacking on Van Gogh. However once more, Van Gogh depends on huge reminiscence bandwidth somewhat than caches, so it doesn’t want an additional degree of cache.
Compute Throughput
From a math throughput perspective, the Customized GPU 0405 is about the identical measurement as Renoir’s iGPU. The throughput distinction we see right here comes all the way down to the Ryzen 7 4800H solely having seven Vega compute items enabled out of eight.
RDNA 2 due to this fact has its execution items balanced in a really related method to Vega (and GCN). Much less frequent operations like FP32 divide, the rest, reciprocal, and inverse sq. root execute at quarter charge. FP64 throughput isn’t prioritized in both product, although Vega can do FP64 provides at a barely higher 1:8 charge. FP64 multiply and multiply-add run at nearer to 1:16 charge although, making these operations act like they do on RDNA 2.
Each architectures can get elevated FP16 throughput by packing two FP16 values right into a 32-bit register, and working packed operations on it. Often that lets FP16 operations execute at double charge. Unusually that wasn’t achieved on Renoir for FMA operations. Neither structure sees a considerable FP16 throughput benefit for particular operations. Maybe including further {hardware} for these operations wasn’t price it.
Integer throughput tells an analogous story. Once more ratios for the frequent operations haven’t modified. If it labored nicely in GCN, there’s no must fiddle with it for RDNA. Integer addition executes at full charge, whereas much less frequent operations like multiplies execute at quarter charge. Integer division and the rest operations get very low throughput.
Van Gogh does prolong a notable lead for 64-bit vector integer operations. I’m undecided the place that comes from as a result of each architectures do vector 64-bit operations by dealing with every 32-bit half individually. For instance, 64-bit addition could be carried out by doing an add-with-carry (v_add_co_u32) on vector registers holding the low half of the 64-bit values. The vector situation code (VCC) would maintain the carry flags, letting a subsequent v_add_co_ci_u32 (add with carry-in from VCC) produce an accurate outcome for the higher half.
CPU to GPU Hyperlink Bandwidth
Excessive LPDDR5 bandwidth has different makes use of too. DRAM acts as a backing retailer for transfers between the CPU and GPU. On an built-in GPU, extra DRAM bandwidth means sooner copies between CPU and GPU reminiscence.
Van Gogh sees glorious switch charges between the CPU and GPU. It’s loads sooner than Renoir, which is restricted by DDR4 bandwidth. And, it’s sooner than the RX 6900 XT, which is restricted by PCIe 4.0. Nonetheless, this spectacular efficiency gained’t be too necessary in a gaming platform. PCIe bandwidth doesn’t have a significant effect on gaming performance till you get to extraordinarily gradual configurations. It’s good to have for compute purposes that offload some work to the GPU, after which do some CPU-side processing on the outcomes earlier than the following iteration. However that’s not what Van Gogh is made for.
Ultimate Phrases
AMD’s Van Gogh mates Zen 2 cores with a contemporary RDNA 2 structure. On the floor, it’d appear like a compelling different to Renoir and Cezanne. These mainstream laptop computer chips are caught on AMD’s older Vega graphics structure, which is a pair generations behind their desktop playing cards. However a more in-depth look exhibits that Van Gogh isn’t a common goal chip. Moderately, Van tries to maximise GPU efficiency in a really tight energy envelope, and makes heavy sacrifices to the CPU aspect.
Van Gogh solely implements a single CCX, which means that multithreaded efficiency will endure in contrast Renoir’s two CCX configuration. Very gradual clock ramp insurance policies will affect responsiveness. The conservative 3.5 GHz max enhance clock makes it one of many slowest consumer Zen 2 implementations. DRAM efficiency from the CPU aspect is disappointing, with excessive latency and poor bandwidth.
The Customized GPU 0405 appears to be like higher, however isn’t freed from sacrifices. It doesn’t have the excessive clock speeds of desktop components. It isn’t a lot bigger than Renoir’s iGPU. It doesn’t have Infinity Cache. But it surely does get most of RDNA 2’s architectural benefits. And most significantly, Van Gogh has a ton of reminiscence bandwidth on faucet to feed that iGPU.
AMD’s Customized APU 0405 is due to this fact an fascinating instance of a really small console chip. Like its cousins within the PS5 and Xbox Collection X, the CPU will get hobbled by low clock speeds, small caches, and excessive reminiscence latency. However it’s a showcase of how RDNA 2 can scale all the way down to very low energy targets whereas sustaining good efficiency. Regardless that CPU efficiency is weak relative to different Zen 2 implementations, Van Gogh’s CPU efficiency in isolation is kind of credible. Zen 2 nonetheless brings in an enormous out-of-order engine with a number of superior architectural options. It ought to be worlds forward of the CPU present in Nintendo’s change, which runs weaker Cortex A57 cores at even decrease frequencies.
In that respect, it exhibits one in every of AMD’s major benefits in opposition to giants like Nvidia and Intel. Nvidia has formidable graphics architectures, however doesn’t have a equally robust CPU division. Intel is the opposite method round. AMD has each, letting it create robust APU implementations for merchandise just like the Steam Deck, Ps 5, and Xbox Collection X. In unhealthy instances, this energy stored AMD afloat till the corporate might bounce again. In higher instances, it makes AMD a robust competitor in area of interest markets, just like the one the Steam Deck fills.
When you like our articles and journalism and also you need to assist us in our endeavors then contemplate heading over to our Patreon or our PayPal if you wish to toss a couple of bucks our method or if you want to speak with the Chips and Cheese workers and the folks behind the scenes then contemplate becoming a member of our Discord.