Now Reading
Intel CPU Die Topology – by Jason Rahman

Intel CPU Die Topology – by Jason Rahman

2023-05-27 23:44:29

Over the previous 10-15 years, per-core throughput slowed down, and in response CPU designers have scaled up core counts and socket counts to proceed rising efficiency throughout generations of recent CPU fashions. This scaling nevertheless will not be free. When scaling a system to a number of sockets utilizing NUMA (Non-Uniform Reminiscence Entry), software program should usually take the NUMA topology and locality under consideration.

Nevertheless, there’s a second degree of locality current in newer, excessive core rely techniques which many overlook. To scale to larger core counts, these excessive core rely CPUs implement a Community-On-Chip (NOC) interconnect inside every bodily CPU die and package deal, sometimes some form of mesh or ring bus, to speak between CPU cores (or more precisely, between L3 cache slices and L2 + L1 caches associated with each core) on the identical bodily CPU die. Relying on the actual topology used for the community, the interconnect may have various efficiency traits.

For many purposes and CPUs, these efficiency variations will likely be negligible in follow for all however probably the most demanding workloads. Nevertheless, they’re detectable with microbenchmarks. And furthermore, sure CPU fashions do have noticeable efficiency impacts on the software degree if the non-uniformity in communication prices between cores on a bodily CPU package deal is uncared for.

To start, we’ll have a look at a pair of Intel CPUs, and observe a couple of key particulars on their on-die interconnect topology and impression on cross-core communication patterns. In a subsequent put up we’ll check out AMD Rome and AMD Milan CPUs, and their rather more nuanced hierarchy.

All measurements are carried out on naked metallic EC2 cases operating in AWS. Naked metallic entire host cases have been used to keep away from interference from co-located workloads on the identical host given the delicate nature of those measurements. You’ll find the instrument used to carry out these measurements on Github right here: https://github.com/jrahman/cpu_core_rtt.

The Intel CPUs we’ll check out right this moment are each monolithic dies with 24 to 32 cores per CPU socket. The diagram beneath reveals the final structure for these dies:

Just a few factors to notice. The mesh interconnect is clearly proven within the diagram. Every CPU core (together with each Hyperthreads) has a Caching and Dwelling Agent (CHA in Intel documentation). The L3 cache for the die is split into a number of “slices”, with one slice hooked up to every CPU core. The L3 cache slice hooked up to a core will not be unique to that individual CPU core. Relatively, any CPU core on the identical bodily die has equal entry to any L3 cache slice elsewhere on the die by sending messages over the mesh interconnect. The CHA is linked to the mesh interconnect as a community cease. Accesses to the L3 slice are from different CPU cores mediated via the Caching/Dwelling Agent, as are outbound messages from the hooked up CPU cores.

With that conceptual background out of the way in which, let’s have a look at a pair examples of the mesh interconnect’s efficiency.

First up, our twin socket Cascade Lake system. This specific system is a c5.metallic AWS occasion with two 24 core (48 thread) Xeon Platinum 8275CL CPUs. The 24 core Cascade Lake SKUs operating in AWS are primarily based on the XCC die, nominally with 28 cores, however operating with 4 of these cores fused off (doubtless for yield and TDP causes).

The diagram above reveals the mesh topology used within the Cascade Lake XCC die. The more serious case for core-to-core communication could be a 18 hop spherical journey between CPU cores on reverse corners of the silicon die.

The heatmap clearly reveals a some key particulars. First, the NUMA price is clearly obvious. The orange/purple/yellow squares signify pairs of cores on totally different NUMA nodes. There may be at the least a 5-10x penalty when speaking with a CPU core on a unique NUMA node. And that is one a twin socket system with solely a single NUMA hop. Increased NUMA issue techniques can incur a number of node hops.

We additionally see the Hyperthreads present up because the blue diagonals. These CPU cores talk via a shared L1 cache, as an alternative of speaking throughout the mesh interconnect inside the CPU die, or throughout the UPI interconnect between CPU sockets.

The histogram present a couple of totally different modes, a small peak on the far left edge is from the low latency RTT between Hyperthreads on the identical bodily CPU core, whereas the first peak round 200 ticks signify on-die communication via the ring interconnect. The remainder of the peaks replicate cross NUMA node over the UPI cross-socket interconnect.

Subsequent up, we’ve a twin socket Ice Lake system. This can be a AWS c6i.metallic host operating twin Xeon Platinum 8375C CPUs, every with 32 bodily CPU cores and 64 hyperthreads for a complete of 64 cores + 128 threads.

The latency heatmap already reveals a really flat and consistency latency profile. Similar to the Cascade Lake knowledge, we observe the Hyperthread pairs present up as partial diagonals. Whereas the separate NUMA nodes present up clearly. Inexperienced squares replicate RTT time between cores on the identical NUMA node, whereas longer RTTs between CPU cores operating on totally different NUMA nodes present up as larger RTT purple squares.

The above histogram additionally reveals the clear triple peak, revealing, from left to proper:

  1. Low hyperthread to hyperthread RTT round ~50 ticks

  2. Inside die mesh interconnect RTT round 175 ticks

  3. Cross NUMA node UPI interconnect RTT round 400 ticks

In comparison with Cascade Lake, we see from each the histogram and the latency heatmap, that Ice Lake has tightened up the latencies between cores inside the similar die. Whereas a number of the variations right here could also be measurement noise, it seems that after a couple of generations utilizing a mesh interconnect (beginning with Skylake) Intel has clearly been making stable technology over technology enhancements.

We’ve seen the pretty flat latency profile which Intel’s on-die mesh interconnect yields. Whereas the histograms do reveal some quantity of variance in communication price, that variance is kind of low, and newer generations have pretty predictable and flat latency profiles throughout the CPU die. For sensible functions, on Intel CPUs, almost all software program can deal with the communication price throughout cores as each uniform, and low sufficient to usually not present up as a efficiency situation.

Subsequent up we’ll cowl AMD CPUs, each Rome (Zen 2) and Milan (Zen 3), which (spoiler) do have noticeable variation in communication prices inside a CPU package deal.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top