CPU-Z’s Insufficient Benchmark – Chips and Cheese
CPU-Z is a {hardware} data device from an organization known as CPUID, to not be confused with the CPUID instruction. Moreover displaying primary CPU, motherboard, and reminiscence data, CPU-Z encompasses a built-in benchmark. Whereas the benchmark isn’t its major perform, it has made its method into some reviews, in addition to AMD’s slides. Its free, accessible nature means it naturally enters on-line discussions. Due to this fact, it’s price investigating and understanding the CPU-Z benchmark.
I’ll be shoving CPU-Z’s benchmark by way of Intel’s Software program Improvement Emulator to get an concept of what directions it executes. Then, I’ll run the benchmark on a number of CPUs and use efficiency counters to guage the way it challenges varied CPU architectures. Because the benchmark doesn’t require AVX2, I’ll be working it on AMD’s older FX-8150 and Intel’s low-end Celeron J4125, along with Core i7-7700K, Ryzen 3950X, and Ryzen 7950X3D. Many of the focus will probably be on the single-threaded benchmark, as a result of the multi-threaded benchmark seems to easily run the identical workload on every logical core with out stressing any system-level assets.
Instruction Combine
CPU-Z’s benchmark is a FP32 math check utilizing SSE directions. It doesn’t leverage SSE’s vector math functionality except for some 128-bit reminiscence accesses. Most SSE directions are scalar FP32 provides, multiplies, conversions, or compares.
The common instruction size is 4.85 bytes, due to SSE directions and reminiscence operands with giant displacements.
The lengthy common instruction size might imply frontend throughput will get restricted by 16 byte per cycle L1 instruction cache bandwidth on older Intel CPUs. Nevertheless, that limitation could be mitigated by op caches, and is simply a problem if the execution engine can attain excessive sufficient IPC for frontend throughput to matter.
CPU-Z benchmark has a typical mixture of reminiscence accesses. Branches are much less widespread in CPU-Z than in video games, compression, and Cinebench 2024.
Efficiency Overview
CPU-Z is a average IPC benchmark. All the CPUs right here obtain a bit beneath half their core width. Bulldozer is an exception and struggles.
AMD’s Zen 4 and Intel’s latest CPUs can account for misplaced throughput by way of pipeline slots. I’m utilizing the methodology described in AMD’s PPR for Zen 4 and utilizing VTune’s output for Kaby Lake. For Goldmont Plus, I’m utilizing the ISSUE_SLOTS_NOT_CONSUMED
and UOPS_NOT_DELIVERED
occasions. Intel and AMD outline Unhealthy Hypothesis in a different way: AMD counts non-retired directions fetched from the frontend; Intel begins counting on the subsequent renamer stage. Unhealthy hypothesis price would seem larger for Zen 4.
Goldmont Plus is a bit totally different as a result of it may possibly’t get well shortly from mispredicts. Like AMD’s Athlon, it has to attend for a mispredicted department to retire, earlier than it may possibly work out the known-good state and proceed. Cycles spent ready for which are labelled “restoration.”
All three CPUs present comparable high-level conduct. Core throughput is nice, with the backend going through the most important problem. The frontend retains the core well-fed with out main issues, however there’s a little bit of room for enchancment.
Backend
The backend will get all of the enjoyable, so let’s begin there. More often than not, the backend can’t filter an operation as a result of it’s ready for a computation to finish. It’s the alternative of what we noticed in games and Cinebench 2024, each of that are overwhelmingly reminiscence certain.
A fast peek at cache hitrates exhibits us why. CPU-Z’s benchmark works on little or no information. It matches throughout the 32 KB L1 information cache on each CPUs above.
Even Bulldozer’s teeny tiny 16 KB L1D enjoys 99.9% hitrate and experiences lower than one miss per thousand directions. Name of Obligation Chilly Warfare misses a 96 MB cache extra typically than CPU-Z misses in a 16 KB one. CPU-Z’s multithreaded mode doesn’t change the image both, as a result of the working set for 2 threads continues to be sufficiently small to permit over 99% hitrates. For instance, Kaby Lake sees L1D hitrate drop from 99.97% to 99.78% for a whopping 0.2% distinction between the one threaded and multithreaded parts of CPU-Z’s benchmark. If that’s not margin of error, I don’t know what’s.
I’m going to cease with the info facet reminiscence footprint evaluation as a result of the L2 cache, L3 cache, and DRAM setup don’t come into the image. The “reminiscence certain” cycles are simply hundreds held up on information cache latency, which the out-of-order engine ought to simply soak up.
Out of Order Execution
Usually, an out-of-order engine’s largest problem is entry to slower cache ranges and reminiscence. L3 entry can take 40-50 cycles, whereas DRAM entry can take a whole lot. With CPU-Z, we’re largely going through down FP execution latency, which usually ranges from 3 to five cycles. CPUs have to maneuver previous in-progress or stalled FP directions to search out unbiased directions.
Beginning with the oldest CPU within the combine, Bulldozer brings a big 60-entry FP scheduler meant to serve two threads. With the sibling thread idle, the shared FPU offers an overkill FP scheduling capability for the lively thread. Stalls on a full FP scheduler nonetheless often occur, suggesting CPU-Z has very lengthy FP dependency chains. Nevertheless, there’s loads of unbiased directions among the many dependent ones. Accomplished unbiased directions can go away the scheduler, however need to be tracked within the reorder buffer till their outcomes could be made closing (retired).
Bulldozer has an 128-entry reorder buffer for every thread, which fills typically and limits how far the core can look forward to extract parallelism. FP port utilization on Bulldozer is low despite the fact that CPU-Z consists primarily of FP directions. Bulldozer’s FP scheduler had one thing in it over 90% of the time, so FP operations look latency certain.
Kaby Lake is a Skylake by-product, making it a great illustration for 5 generations of Intel CPUs. It’s an particularly good illustration for this benchmark since L3 cache and reminiscence controller enhancements don’t matter. A big 58-entry scheduler handles each scalar integer and FP operations. As in Cinebench 2024, that’s a legal responsibility as a result of the core can’t convey as a lot complete scheduling capability to bear.
Nevertheless, Kaby Lake can hold extra unbiased operations in flight with its 224-entry reorder buffer. Fundamental FP operations on Kaby Lake usually have a 4-cycle latency in comparison with 5 cycles on Bulldozer, so Kaby Lake can get by way of these FP dependency chains sooner too.
Kaby Lake’s port utilization is in a great place, with heavy however not overwhelming load on the 2 FP pipes. Execution latency is extra of a problem than port depend, despite the fact that Kaby Lake has fewer FP ports in comparison with modern AMD CPUs.
AMD’s Zen 2 structure competed with Skylake derivatives. It options an excellent higher 3-cycle latency for FP provides and multiplies. The FPU has a 36-entry scheduler and a 64-entry non-scheduling queue in case the scheduler fills up. That lets it observe 100 incomplete FP operations earlier than inflicting a stall on the renamer, shifting the bottleneck to the 224-entry ROB.
FP port utilization is in a great place for Zen 2. The FP add pipes see barely larger load, doubtless as a result of the FP2 pipe is shared with FP shops.
Zen 4 is AMD’s present era structure. It might probably hold almost 128 incomplete FP operations in flight due to an improved 2×32 FP scheduler format and a 64-entry non-scheduling queue in entrance of it. We nonetheless see some FP scheduler stalls nonetheless. That counter solely increments when the non-scheduling queue fills as nicely, so CPU-Z’s FP dependency chains are very lengthy.
As an alternative of the ROB filling as on Zen 2 and Bulldozer, Zen 4 runs out of FP register file capability. That implies lots of the unbiased directions discovered amongst dependent ones are floating level operations. Zen 4 has a 192-entry FP register file. It’s bigger than the 160-entry one in Zen 2 and Zen 3, however from AMD’s slides, evidently that didn’t assist a lot.
Like the opposite CPUs, Zen 4 isn’t certain by ports. Utilization is decrease than in Zen 2, due to the devoted FStore pipes taking load off the FP add pipes. Zen 2 put floating level and vector shops on one of many FP add pipes, leading to larger utilization.
From this information, I think FP execution latency issues most for CPU-Z. Then, the out-of-order execution engine wants sufficient scheduler capability to get previous lengthy dependency chains and seek for instruction-level parallelism. Lastly, the core wants a giant reorder buffer and register recordsdata to carry outcomes for all these in-flight directions.
Frontend
A CPU’s frontend is accountable for bringing directions into the core. It’s not significantly challenged by CPU-Z’s benchmark, but it surely’s good to know why. Briefly, the instruction footprint is tiny. Latest CPUs have an op cache that holds decoded directions in a CPU’s inner format (micro-ops). Nevertheless, op caches need to be small as a result of micro-ops are sometimes bigger than x86 directions. CPUs nonetheless have formidable decoders as a result of op caches are designed for velocity fairly than caching capability. Engineers due to this fact count on code to spill out of the op cache fairly often. However CPU-Z doesn’t.
Kaby Lake, Zen 2, and Zen 4 have micro-op caches. All three ship greater than 90% of micro-ops from the op caches, so the decoders hardly ever come into play. That is nice for Kaby Lake, which might in any other case need to feed a four-wide decoder with 16 bytes per cycle of instruction cache bandwidth. Recall that common instruction size is 4.85 bytes, so Kaby Lake’s instruction cache solely has sufficient bandwidth to maintain 3.3 IPC. Zen 4’s bigger op cache is overkill as a result of Zen 2’s 4096-entry one is already giant sufficient to comprise the complete benchmark.
The multithreaded portion of CPU-Z’s benchmark might present extra of a problem, as a result of it’ll make a core’s two SMT threads combat for micro-op cache capability. Intel comes off the worst right here as a result of it statically partitions the micro-op cache. With 768 entries accessible per thread, micro-op cache protection drops to 84.16%. That’s nonetheless larger than what Intel anticipated for typical functions when Sandy Bridge launched 12 years in the past. And I’m fairly positive Intel evaluated the perfect case the place the opposite SMT thread is both idle, or SMT is off fully.
AMD makes use of bodily addresses to entry the op cache, so entries could be shared throughout a number of threads working the identical code. In consequence, Zen 2 and Zen 4 barely take any hitrate discount when CPU-Z’s benchmark switches to multithreaded mode.
For CPUs that don’t have an op cache, the L1 instruction cache accommodates CPU-Z’s code. everybody will get round 99.9% hitrate.
With almost zero instruction cache misses, CPU-Z doesn’t problem the frontend on any CPU. Thus, the problem strikes from feeding the core with directions to executing these directions. Working two threads within the core barely modifications issues. For instance, Bulldozer’s L1i hitrate drops from 99.88% to 99.65% when two threads are working in a module, for lower than 1% of distinction.
Department Prediction
A department predictor’s job is to comply with the instruction stream. In comparison with Cinebench and gaming workloads, CPU-Z has fewer branches, that are additionally simpler to foretell. Bulldozer has a tougher time, however mispredicts are a minor drawback even on that previous core. Even Zen 4 suffers extra mispredicts per instruction in games.
Intel and AMD’s extra fashionable cores have comparable department prediction accuracy. AMD has invested lots of die space into extra superior department predictors, however features little within the CPU-Z benchmark. That implies most branches are simply predictable, however a small minority both require very lengthy historical past or aren’t seen typically sufficient for the predictor to coach on them. Sharing the department predictor between two threads with CPU-Z’s multithreaded benchmark mode barely reduces department predictor accuracy, but it surely’s nonetheless nowhere close to as difficult as Cinebench 2024 or video games.
Department Footprint
The entire level of department predictors is to make branches go quick. A part of this entails remembering the place generally seen branches go, utilizing a cache of department targets known as a department goal buffer (BTB). Getting the department vacation spot from the BTB lets the predictor inform the frontend the place to go with out having to see the department and calculate the vacation spot from there.
CPU-Z has fewer distinctive branches than Cinebench 2024 or video games. Even Goldmont Plus has no drawback monitoring CPU-Z’s branches, despite the fact that its BTB is the smallest of the bunch.
Excessive efficiency CPUs have 4096 or extra BTB entries to take care of bigger department footprints, however CPU-Z doesn’t want that. Even Zen 4’s 1.5K-entry L1 BTB is giant sufficient.
Closing Phrases
Benchmarking is hard. No benchmark can characterize the broad vary of functions that customers will run. For instance, Cinebench can’t precisely mirror a gaming workload. Nevertheless, the first challenges going through fashionable workloads are department prediction and reminiscence accesses, and lots of benchmarks do current these challenges.
What limits laptop efficiency at the moment is predictability, and the 2 massive ones are instruction/department predictability, and information locality.
Jim Keller, throughout an Interview with Dr. Ian Cutress
That’s not simply Jim Keller’s opinion. I’ve watched CPU efficiency counters throughout my day-to-day workloads. Throughout code compilation, picture modifying, video encoding, and gaming, I can’t consider something that matches throughout the L1 cache and barely challenges the department predictor. CPU-Z’s benchmark is an exception. The components that restrict efficiency in CPU-Z are very totally different from these in typical real-life workloads.
From AMD’s slides, Zen 4 barely improves over Zen 3 for CPU-Z. AMD’s architects doubtless noticed modifications that would profit CPU-Z wouldn’t repay in different functions. Zen 4 acquired enhancements like a bigger micro-op cache, higher department prediction, and doubled L2 cache capability. These would assist lots of functions, however not CPU-Z. Thus, CPU-Z’s benchmark finally ends up being ineffective to each CPU designers and finish customers.
I hope CPUID revises its benchmark to convey it extra in keeping with widespread workloads. They need to run typical desktop functions and monitor efficiency counters to see what typical department and reminiscence footprints appear like. A lot of these information needs to be used to tell benchmark design for future CPU-Z variations.
In case you like our articles and journalism, and also you need to help us in our endeavors, then contemplate heading over to our Patreon or our PayPal if you wish to toss a couple of bucks our method. If you want to speak with the Chips and Cheese workers and the folks behind the scenes, then contemplate becoming a member of our Discord.