Caching and Conclusion – Chips and Cheese
In Part 1, we looked at Bulldozer’s core architecture. However the structure itself isn’t the total story. Reminiscence advances haven’t stored up with CPU pace, so trendy CPUs deal with more and more subtle caching setups. They’ve to deal with cache latency as effectively, as a result of excessive reminiscence latency drives bigger caches, and it’s onerous to make a cache each massive and quick.
To keep away from reminiscence bottlenecks, Bulldozer makes use of a triple stage cache hierarchy like many different trendy CPUs.
Cache and Reminiscence Entry
Bulldozer makes massive cache hierarchy adjustments in comparison with K10. Latencies are typically larger, however Bulldozer does have a large quantity of on-chip cache.
Bulldozer | Sandy Bridge | K10 | |
L1 Information Cache | 16 KB 4-way Write-through Parity Protected |
32 KB 8-way Write-back |
64 KB 2-way Write-back ECC Protected Tags are Parity Protected? |
Write Coalescing Cache | 4 KB ECC Protected |
N/A | N/A |
L2 Cache | 2 MB 16-way ECC Protected |
256 KB 8-way | 512 KB 8-way (varies) ECC Protected |
L3 Cache | 8 MB 64-way ECC Protected |
Varies, however typically 2 MB * core depend | 6 MB 48-way ECC Protected |
Information TLBs | L1: 32 entry absolutely associative L2: 1024 entry 8-way Parity Protected |
L1: 64 entry L2: 2048 entry |
L1: 48 entry L2: 512 entry 4-way Parity Protected |
L1 Information Cache
AMD initially wished to implement a 64 KB information cache similar to the one in K10. However that wouldn’t match, and ended up protruding of the module’s sides after they tried. L1D dimension wound up getting lower to 16 KB, and three cycle latency was dropped in favor of creating it 4 cycles. Even with a 16 KB cache, AMD needed to make concessions to hit their clock pace targets. From AMD’s ISSCC presentation, Bulldozer’s L1D switched to utilizing 8T, 0.294 μm2 bitcells, in comparison with the 6T ones utilized in 45 nm K10. 8T implies that eight transistors are used to retailer each bit, decreasing storage density. However AMD wanted 8T bitcells to enhance efficiency at decrease voltages, enhance clocks, and scale back energy utilization. This change is an instance of why transistor density is a poor metric. 8T SRAM would lead to larger transistor counts and presumably larger transistor density, however much less environment friendly use of die space. These points appear associated to teething points on the 32 nm course of, as a result of AMD had to make use of 8T SRAM for Llano’s L1D caches after they shrank K10 to 32 nm. Like Llano, Bulldozer’s L1D makes use of eight bitcells per bitline. In equity, most trendy designs additionally use 8T SRAM for the primary stage caches, so the change is no less than partially because of the challenges of transferring to denser nodes.
To deal with the capability discount, AMD elevated the L1D’s associativity to 4-way. That’s unlikely to make up for the large capability lower, however it’s higher than nothing. Smaller capability additionally affected the banking scheme, as a result of constructing a L1D with fewer banks would in all probability result in extra financial institution conflicts. K10 and Bulldozer each use banking to let the L1D deal with two simultaneous accesses per cycle. To keep away from ending up with fewer banks and extra financial institution conflicts, Bulldozer makes use of a micro-banking scheme to create 16 logical banks throughout the L1D, much like K10’s 16 banks.
AMD additionally took measures to save lots of energy. Bulldozer can use a single L1D learn to service each L1D ports, if each accesses learn the identical 16 byte sector. The TLB may keep away from redundant reads if accesses hit the identical web page, which additionally occurs very often. 4-way associativity implies that 4 tags must be checked, however AMD makes use of way-prediction to guess which L1D method may have the requested information. By doing so, it avoids checking 4 full tags with every entry. Bulldozer due to this fact has a really superior L1D design, however that’s overshadowed by failing on the basics. The cache merely doesn’t have sufficient capability, and hitrate suffers consequently. In some circumstances, Bulldozer can undergo greater than twice as many L1D misses per instruction.
By way of learn bandwidth, the L1D no less than maintains efficiency parity with K10. The cache can service two 128-bit masses per cycle. In principle, every module has two unbiased L1D caches. Nonetheless, the FPU can solely settle for a pair of masses per cycle, and due to this fact caps a module’s most load bandwidth at 2×128-bit.
In contrast to learn bandwidth, write bandwidth is quite poor as a result of AMD made the L1D write-through.
Write-By means of L1D, and WCC
Previously on Chips and Cheese, we in contrast Netburst’s write by means of L1D to skydiving and not using a parachute. In the identical method that skydiving setup is simpler when you don’t must care a few parachute, a write-through L1D is simpler to implement as a result of the cache by no means has to carry the one copy of a modified line. Meaning CPU designers can go for cheaper parity safety as a substitute of ECC, and don’t have to fret about doing write-backs later. Whereas exhaustive double-blinded research weren’t carried out on a big scale, it’s typically accepted that skydiving and not using a parachute might result in non-ideal well being outcomes. Equally, Intel discovered {that a} write-through L1D was non-ideal for efficiency, and didn’t do it once more.
AMD’s engineers aren’t silly, and discovered from Intel’s mistake. As an alternative of writing by means of to a slow-as-shit L2, Bulldozer’s L1D is write-through to a 4 KB “Write Coalescing Cache” (WCC). The WCC is 4-way set associative, and acts as a write-back cache between the core and L2. As an alternative of skydiving and not using a parachute, you’ll be able to consider this as skydiving with a really small parachute. You would possibly break each legs and numerous different physique components. However medical literature means that waking up within the hospital is an enchancment over not waking up in any respect.
Nonetheless, the WCC isn’t nice. Write bandwidth to it finally ends up being very mediocre at simply over 10 bytes per cycle. Phenom and Sandy Bridge can each maintain a 16 bytes retailer each cycle, restricted by the only write port on their caches. Capability can be a problem. 4K is just not lots of capability, particularly when it’s shared by two threads. After WCC capability is exceeded, L2 bandwidth is kind of low at simply over 5 bytes per cycle on common. Operating two threads within the module doesn’t enhance write bandwidth.
Bulldozer’s L1D finally ends up being a major weak level within the design. Dropping to 4 cycle latency makes the L1D slower than the one in Phenom even when we account for clock pace will increase. On the similar time, the dramatic discount in capability means the core will run into L2 latency extra, and have larger calls for for L2 bandwidth.
L2 Cache
Bulldozer’s 2 MB, 16-way L2 cache is bodily applied with 0.258 μm2 6T SRAM bitcells. In comparison with the 0.275 μm2 6T SRAM bitcells utilized in Westmere’s L2 cache3, AMD’s bitcells are literally a bit denser. These are organized into 128 KB slices, every with eight 16 KB macros. The slices are then organized into 4 banks. Internally, the L2 cache has a 6 cycle pipeline. We measured 20 cycles of load-to-use latency, which is 16 cycles past the 4-cycle L1D. Desirous about the opposite 10 cycles is a enjoyable train. I’m guessing lots of these are used for queuing and transit.
In comparison with K10, Bulldozer’s L2 is bigger however slower. This can be a fairly cheap tradeoff, as a result of hitrate with a 2 MB cache must be very excessive. Sandy Bridge’s L2 takes the other strategy, providing very low latency however solely 256 KB of capability.
By way of bandwidth, Bulldozer’s L2 is a notable enchancment over K10. A single thread will get barely higher bandwidth per clock, and a clock pace bump offers the FX-8150 a 34% L2 bandwidth benefit over the Phenom X6 1100T. Sandy Bridge’s L2 may give a single thread even larger bandwidth, however is far smaller.
When operating two threads within the module, Bulldozer can present twice as a lot L2 bandwidth. That implies the L2 has separate paths to every thread’s load/retailer unit, and contours up with AMD’s publications.
The L2 can thus present extra bandwidth than Sandy Bridge’s, whereas being eight instances bigger. I’m a bit unhappy that the L2 can’t ship 32 bytes per cycle to a single thread, however maybe that was a neater method to offer excessive L2 bandwidth for a multithreaded load. To AMD’s credit score, Bulldozer’s L2 does present a formidable mixture of caching capability and bandwidth, particularly if every module is engaged on a non-public set of knowledge.
Bulldozer’s L2 is one thing of a vivid level within the structure. Its dimension benefit can’t be understated, and can dramatically scale back site visitors on the chip’s interconnect. Conserving site visitors throughout the module and off of the L3 is vital, as a result of reminiscence efficiency goes off a cliff after a L2 miss.
L3 Cache, and Interconnect
Bulldozer implements a 8 MB, 64-way set associative L3 cache shared by all cores on the die. Whereas the L3 is massive, 8 MB matches the sum of all L2 capacities on the die. To make the L3 efficient, AMD tries to make the L3 principally unique of the module-private caches. Particularly, information is faraway from the L3 every time a module brings it into its personal caches, until the info is presumably accessed by a number of modules. The L3 additionally features as a sufferer cache, which implies the one method for information to get into the L3 is by getting kicked out of L2.
L3 latency is horrible at over 18 ns. That’s massively larger than Sandy Bridge EP’s 12.7 ns L3 latency. Bulldozer manages to regress in comparison with Phenom, even in comparison with Phenom SKUs with decrease Northbridge clocks than the FX-8150. For instance the Phenom X4 945 has a 2 GHz Northbridge clock, and has 16.35 ns L3 latency. L3 associativity is a definite perpetrator right here.
From Phenom II to Bulldozer, AMD elevated L3 cache capability to eight MB by preserving the identical variety of units and growing associativity from 48-way to 64-way. 64-way associativity implies that any location in reminiscence can go into 64 attainable slots within the cache. Larger associativity reduces the prospect of helpful traces getting kicked out due to restrictions on the place sure reminiscence places may be cached (battle misses). However the draw back is every method must be checked to find out if there’s a success. Doing 64 tag and state checks doesn’t sound low cost in any respect. I think AMD needed to lengthen the L3 pipeline to make that occur, growing latency. However there’s extra to the latency story. Because the graph above exhibits, AMD suffered from comparatively excessive L3 latency throughout Phenom and Bulldozer.
AMD’s Northbridge structure deserves some dialogue right here. The Northbridge is AMD’s equal of Intel’s “Uncore”, and traces its ancestry again to K8. In K8, extra generally often called Athlon 64, the Northbridge centered on integrating the reminiscence controller onto the die, and enabling extra environment friendly cross-socket connectivity. At this level, the Northbridge’s latency and bandwidth capabilities weren’t too vital. DRAM bandwidth can be measured within the low tens of GB/s. IO and cross-socket bandwidth can be decrease, and core-to-core site visitors can be decrease nonetheless. And latency for all of these can be very excessive relative to cache bandwidth.
K10 adjustments issues by including a L3 cache related to the Northbridge. As a result of the Northbridge acted as a central hub between the entire cores, it was a pure place to attach a shared cache. However a cache clearly has a lot decrease latency and better bandwidth than DRAM. Abruptly, the Northbridge had much more to cope with, although its structure was principally unchanged. It ran at a lot decrease clocks than the CPU cores, and tried to deal with requests from all CPU cores by means of a centralized queue.
Bulldozer retains the identical Northbridge structure. Whereas the L3 seems to be bodily break up into 4 2 MB sections, all of it must be accessed behind the centralized Northbridge. In Sandy Bridge, the L3 slices are positioned on a hoop interconnect, with cache controllers on every ring cease. Intel’s L3 due to this fact acts a multi-ported, banked cache able to dealing with a number of accesses per cycle, and offers Sandy Bridge a large bandwidth benefit.
However in equity to AMD, they weren’t alone in attempting to stay a L3 behind a centralized crossbar. Intel did the identical with Nehalem. We examined Intel’s Xeon X5650, which is a die shrink of Nehalem to Intel’s 32 nm course of. L3 latency isn’t nice at over 15 ns, however bandwidth is just not as horrible as Bulldozer’s. Northbridge or Uncore clock is definitely a variable. Westmere’s Uncore clock is unknown, however it’s unlikely to be greater than twice as excessive because the FX-8150’s 2.2 GHz Northbridge clock.
As an alternative, I feel Bulldozer’s sufferer cache operation is accountable for its learn bandwidth deficit. The L3 is just stuffed by L2 evictions, which implies these traces are all the time written again to L3, no matter whether or not they had been modified. That may double cache site visitors, as a result of the L3 is normally dealing with a writeback from L2 together with a lookup request. In distinction, Intel’s L3 solely receives a sufferer line if the core wrote to it.
Subsequently, when the learn bandwidth take a look at sees 35 GB/s on Bulldozer, the L3 is definitely coping with 70 GB/s of bandwidth. If we monitor L3 hits with efficiency counters, we see about 35 GB/s. If we embrace counts from Northbridge occasion 0x4E2, which tracks L3 fills attributable to L2 evictions, we get double that. We are able to see the total 70 GB/s from software program if we use a read-modify-write sample, during which case the writebacks (copy backs?) are helpful to software program. However such patterns appear to be extra the exception than the rule in observe.
That places Bulldozer’s L3 bandwidth in a really dangerous place. A single module on the FX-8150 can pull round 15 GB/s from L3. Operating two threads within the module barely will increase bandwidth, however not by a lot. It’s a slight enchancment over the Phenom X6 1100T, the place one core can pull 12-14 GB/s from L3, however miles away from Sandy Bridge. The L3 seems to be even worse after we take into account how shut it’s to DRAM bandwidth. A single module can pull about 12.3 GB/s from DRAM with each threads loaded, or 8.8 GB/s with a single thread. The FX-8150’s L3 due to this fact beats DRAM bandwidth by lower than an element of two for a single thread, and an embarrassing 25% when a module is operating two threads. Sandy Bridge’s L3 is on a distinct planet.
L3 bandwidth is much more vital for multithreaded functions. With all threads energetic, every module will get below 10 GB/s of bandwidth from L3. That’s much less bandwidth than it may possibly pull from DRAM, offered no different modules are energetic. For certain, having a L3 continues to be higher than nothing, as a result of 35-39 GB/s from L3 continues to be higher than DRAM bandwidth. However Sandy Bridge’s full pace, ring based mostly L3 may give over 30 GB/s to a single core below an all-core load.
Overclocking the Northbridge to 2.4 GHz offers 42 GB/s of L3 bandwidth and 17.2 ns of latency. Possibly we’d see satisfactory L3 efficiency if the Northbridge clock a lot larger and run near core clocks. However making an attempt to run the Northbridge at even 2.6 GHz resulted within the system locking up, so it’s clear the Northbridge wasn’t designed to run at excessive clock charges.
The L3 on Bulldozer is a disappointment. It inherits most of K10’s L3 structure, however regresses efficiency in return for a minor capability enhance. That regression is unacceptable as a result of K10’s L3 efficiency was already marginal, and Bulldozer’s elevated multithreaded efficiency means extra potential demand for L3 bandwidth. And worst of all, Sandy Bridge is sitting round with a wonderful L3 implementation. Bulldozer’s L3 is best than nothing in any respect, however it’s utterly unsuited for its objective. The L3’s poor efficiency then dictates a big L2, and it’s onerous to make a L2 that’s each massive and quick.
Core to Core Coherency Latency
We didn’t cowl core to core latency so much, as a result of cross-core communication is quite rare, and normally not value spending article house on. However how a lot one thing issues is determined by each how usually it occurs, and the way costly it’s when it does occur. Bulldozer actually pushes the second a part of that equation.
Contested lock cmpxchg latency is okay between two threads in a module, however going between modules prices over 200 ns. On common, you’d see about 220 ns of latency for going between threads, assuming the OS isn’t good sufficient to schedule threads onto the identical module if they convey with one another so much. That’s a big regression in comparison with K10, and likewise compares badly towards Intel’s Sandy Bridge.
Per instruction, core to core latency value might be round in all probability round 10-15% of DRAM latency’s value, if we compute utilizing figures from the i5-6600K. The i5-6600K’s caching capability is far decrease than Bulldozer’s, so Bulldozer will in all probability undergo a bit much less from L3 misses, bringing that quantity up a bit. If I needed to guess, core to core latency in all probability docks gaming efficiency by a p.c or so. It’s not an enormous subject, however is probably sufficient to be barely greater than a rounding error in particular circumstances.
Reminiscence Controller Efficiency
In DRAM sized areas, Bulldozer enjoys a conventional AMD benefit with a really low latency reminiscence controller. In comparison with Phenom, Bulldozer positive aspects official assist for sooner DDR3-1866. The take a look at system was arrange with comparatively quick DDR3-1866 CL10, and we see simply 65 ns of reminiscence latency. The Phenom X6 1100T does barely higher at 57.8 ns utilizing DDR3-1600 CL9, however Phenom doesn’t formally assist such quick reminiscence anyway.
By way of bandwidth, Bulldozer will get 23.95 GB/s, beating the Phenom X6 1100T’s 20.55 GB/s. When geared up with quick DDR3-1866, Bulldozer must also outperform Sandy Bridge chips, which solely assist DDR3-1333.
We’re not going to learn too far into reminiscence efficiency as a result of we are able to’t match reminiscence setups, however Bulldozer has a fairly good reminiscence controller.
Efficiency with 4 KB Pages
2 MB pages are good for displaying uncooked cache latency, however most shopper functions use 4 KB pages. On paper, Bulldozer’s bigger 1024 entry L2 TLB ought to assist with efficient cache entry latencies, in comparison with K10’s smaller 512 entry one. However Bulldozer’s L2 TLB is extraordinarily gradual. We see over 15 cycles of added latency on the 1 MB take a look at dimension. If we attempt to isolate TLB latency by latency variations when accessing one component per 4 KB web page, we get a 20 cycle delta after going previous L1 TLB capability.
One perpetrator right here is that the L2 TLB is positioned outdoors the load/retailer unit, in what AMD calls the “cache unit” (CU). The CU additionally comprises the L2 cache datapath, inserting it distant from the person cores. Placing the L2 TLB there lets AMD simply share it throughout each threads, which is smart as a result of the L2 TLB is a big construction that’s too costly to duplicate in every thread’s load/retailer unit. However taking 20 cycles to hit the L2 TLB is just not a very good tradeoff.
For comparability, hitting the L2 TLB on Sandy Bridge appears so as to add round 7 cycles, whereas K10 impressively solely suffers a 2-3 cycle penalty. Bulldozer’s smaller L1 TLB doesn’t assist both, as a result of that’ll imply fewer fortunate hits as reminiscence footprints attain deep into L2 TLB protection.
In consequence, Bulldozer’s L2 benefit is hampered by digital to bodily tackle translation latency. At take a look at sizes between 512 KB and a pair of MB, Bulldozer’s latency finally ends up fairly near Sandy Bridge or K10’s, although Bulldozer is fetching information from L2 whereas the opposite architectures are doing so from L3. Even when K10 misses its L2 TLB, it nonetheless enjoys a large L3 cache latency benefit over Bulldozer. I think that’s as a result of K10 employs a 24 entry page walk cache, which caches entries within the subsequent paging construction up (web page listing on this case). That web page stroll cache would cowl 96 MB, and allow comparatively quick web page walks over that space. Bulldozer makes use of the L2 TLB as a web page stroll cache. I’ve no clue how the structure determines whether or not to fill a direct translation or a better stage one into the L2 TLB. However clearly it’s not working very effectively, as a result of taking 105 cycles to seize information from L3 is moronic.
Bulldozer’s Struggles (In opposition to Sandy Bridge)
Bulldozer struggled to compete towards Sandy Bridge particularly in single threaded efficiency. There’s no single trigger for this. Slightly, Sandy Bridge’s big lead comes right down to a mixture of AMD’s struggles and Intel pulling off some unimaginable advances with Sandy Bridge.
AMD initially wished to carry the road with Phenom’s IPC whereas utilizing clock pace will increase to ship a single threaded efficiency enhance. If AMD managed use K10-sized integer and cargo/retailer items, I might see that occuring. Bulldozer’s shared frontend and FPU can present full throughput to a single thread. Each shared items are much more highly effective than those in K10. Mix that with any enhancements in any respect to the per-thread integer and cargo/retailer items, and I might truly see an IPC enhance over Phenom. However AMD was aiming too excessive. In a single structure, they had been attempting to implement a seize bag of superior architectural methods, transfer to a brand new course of node, and multithread components of the core.
AMD’s 32nm Points
Unsurprisingly, AMD was compelled to chop again. It’s unimaginable to speak in regards to the cutbacks with out speaking in regards to the course of node and clock pace objectives. The corporate’s publications drop hints of how onerous it was to attain their objectives whereas transferring to the 32 nm course of.
The change from a 6T cell in 45nm to 8T in 32nm was required to enhance low-voltage margin and skim timing and to scale back energy. Use of the 8T cell additionally eradicated a troublesome D-cache read-modify-write timing path.
Design Options for the Bulldozer 32nm SOI 2-Core Processor Module in a 8-Core CPU, ISSCC 2011
Studying between the traces, AMD is saying they switched to much less space environment friendly 8T SRAM as a result of they may not keep clock speeds. However even that wasn’t sufficient. In addition they needed to scale back bitline loading through the use of 8 cells per bitline as a substitute of 16, like on prior generations. From speaking to a former AMD engineer, we all know the 64 KB L1D was lower to 32 KB, then to 16 KB. The elevated space required by the L1D design tradeoffs might even have contributed to creating the L1D a write-through design. AMD wanted to save lots of all the realm they may, and utilizing parity as a substitute of ECC would require much less storage.
Within the integer unit, AMD duplicated the register file to extend clock speeds. As we noticed from die plots earlier, the replicated register file takes lots of house – house that would have been used to implement extra execution items or extra reordering capability.
To take away critical-path wire delay, the bodily register file arrays and tackle generator (AGEN) incrementor are replicated.
Design Options for the Bulldozer 32nm SOI 2-Core Processor Module in an 8-Core CPU, ISSCC 2011
It’s simple accountable aggressive clock targets, however I think the method node is extra at fault right here. AMD truly ported K10 to 32 nm for Llano, and suffered related difficulties although Llano didn’t goal larger clocks. Actually, Llano wound up affected by decrease clocks than its Phenom predecessors. Llano’s story is one for an additional day, however AMD once more factors to 32 nm node challenges in its improvement:
The conversion to the 32nm high-k metal-gate (HKMG) course of presents a number of challenges not encountered in earlier generations. The PMOS drive energy elevated considerably relative to NMOS. This compelled revision of the keeper design for dynamic nodes and precipitated unacceptable delay enhance in some circumstances.
An x86-64 Core Applied in 32nm SOI CMOS, ISSCC 2010
In different phrases, even porting a 45nm design to 32nm with minimal architectural adjustments was not simple, until you wished to let clock speeds drop by means of the ground and eat a large total efficiency lower. Even with Llano, AMD spent lots of effort working round 32nm points earlier than they may ship a product with acceptable efficiency.
Electromigration (EM) remediation is tougher as a result of the current-carrying limits are lower than the scaled values projected by geometric scaling. Cautious consideration is paid to indicators with excessive exercise on minimum-width wires. Methods to scale back capacitance of such nets and specialised straps are used extensively to divert the present throughout a number of parallel paths.
An x86-64 Core Applied in 32nm SOI CMOS, ISSCC 2010
I don’t have {an electrical} engineering background so I gained’t learn an excessive amount of into course of node challenges, however from studying AMD’s publications, it’s clear that the 32 nm node offered important difficulties for the Bulldozer effort.
Not Catching Up Sufficient
Whereas Bulldozer is considerably extra superior than its K10 predecessor, it nonetheless trails Sandy Bridge in some areas. Its department predictor improves over K10’s, however nonetheless can’t match Sandy Bridge’s prediction pace or accuracy. Retailer forwarding is improved, however Sandy Bridge is ready to deal with much more forwarding circumstances with out penalty. AMD launched department fusion for CMP and TEST directions, however Intel had been iterating on department fusion since Core 2, and Sandy Bridge can fuse most ALU directions with an adjoining department.
By way of clock speeds, Bulldozer does clock larger than K10. The FX-8150 might enhance as much as 4.2 GHz, whereas Llano’s high SKU solely reached 3 GHz on the identical 32 nm node. However Sandy Bridge additionally clocked fairly effectively, and Bulldozer’s clock pace benefit wound up being fairly small.
In different areas, Bulldozer didn’t enhance over K10, however Sandy Bridge made big leaps. For instance, AMD stored the identical Northbridge and L3 structure, whereas Sandy Bridge introduced in new ring interconnect. Poor L3 efficiency compelled a bigger L2. However enlarging the L2 additionally gave it worse latency traits, which punished Bulldozer’s small, write-through L1D.
Not Large Sufficient (Per-Thread)
No matter whether or not you wish to see the FX-8150 as an eight core or quad core CPU, it performs extra prefer it has eight small cores than 4 massive ones with SMT. Every Bulldozer thread has full entry to a module’s frontend bandwidth and FPU assets, however solely half of the module’s reordering capability. Except for the FPU scheduler and register file, the entire key out-of-order execution buffers out there to a single Bulldozer thread are smaller than the equivalents in Sandy Bridge.
That by itself isn’t essentially an enormous subject. Zen 4 competes effectively with Golden Cove regardless of having smaller OoO buffers, however it does so by having a decrease latency cache subsystem. Bulldozer’s cache subsystem is larger latency than Sandy Bridge’s, placing Bulldozer in a poor place. If we mix assets out there to each threads in a module, Bulldozer begins trying fairly respectable. For instance, a Bulldozer module operating two threads would have 2×128 ROB entries out there, whereas a Sandy Bridge core would partition its 192 entry ROB right into a 2×96 configuration. The identical applies for different buildings. Bulldozer can then convey a big reordering capability benefit to bear, mitigating its cache latency drawback. Reminiscence footprint additionally tends to extend with extra threads in play, and Bulldozer’s massive caches can turn out to be useful there. In some circumstances, Bulldozer may be very competitive in effectively threaded functions.
Aggressive multithreaded efficiency is sweet, however lots of desktop functions are usually not embarrassingly parallel and that may be a big drawback for Bulldozer. 4 robust cores will probably be far simpler to make the most of than eight weak ones. If these 4 cores can obtain larger throughput by way of SMT when given eight threads, that’s nice. However efficiency with fewer threads can’t be compromised for shopper markets.
New Structure Laborious
From the beginning, AMD knew energy effectivity went hand in hand with concentrating on excessive multithreaded efficiency, so Bulldozer has energy saving methods sprinkled in all over. They switched to a PRF-based scheme, optimized the scheduler, and used way-prediction to entry the 4-way L1D. Clock gating helped scale back energy when numerous purposeful items just like the FPU had been idle. AMD paid lots of consideration to decreasing important path lengths within the format, and even duplicated the integer register file to place the ALUs and AGUs nearer.
On high of that, AMD applied superior energy administration methods. Like Sandy Bridge, Bulldozer actively displays energy to permit for opportunistic boosting whereas staying inside TDP limits. Neither CPU straight measures energy consumption. As an alternative, numerous occasions across the CPU are monitored and fed right into a mannequin that predicts energy consumption.
However we all know that ultimately, Bulldozer was much less energy environment friendly than Sandy Bridge. One cause is that AMD struggled to optimize energy in a design that differed radically from their earlier ones.
Usually, CPU energy effectivity is achieved by means of cautious evaluation of the ability consumption of the finished implementation to determine waste and discount alternatives, that are then applied in a subsequent model of the core. With a ground-up design resembling Bulldozer, this type of evaluation loop was not attainable: energy effectivity needed to be designed concurrently with convergence on timing and performance
Bulldozer: An Method to Multithreaded Compute Efficiency, IEEE 2011
I think that is like software program optimization, the place you first determine the largest contributors to runtime earlier than concentrating on your optimization efforts there. When you don’t do your profiling appropriately, you could possibly spend a large period of time optimizing a piece of code that solely accounts for just a few p.c of runtime. Equally, when you can’t determine the largest sources of energy waste, you would possibly spend lots of time optimizing one thing that doesn’t waste lots of energy within the first place. In fact, the precise image is probably going extra complicated. AMD’s engineers used “RTL-based clock and flip-flop exercise evaluation as a proxy for switching energy”. Transistors which might be switching on and off are inclined to devour extra energy, so these simulations gave the engineers an thought of the place to optimize at an early stage. Because the design progressed, AMD acquired a greater thought of what actual energy consumption can be, however massive adjustments are tougher to make because the design nears completion.
Excessive clock targets made energy optimization even tougher, particularly as getting issues to clock up on 32 nm gave the impression to be very troublesome. AMD needed to tweak transistors to make the fitting tradeoffs between leakage (energy waste) and delay (means to succeed in larger clocks).
Bulldozer’s dramatic single-threaded efficiency loss to Sandy Bridge due to this fact can’t be defined by a single issue, or perhaps a small set of things. Slightly, Bulldozer has an extended record of weaknesses in comparison with Sandy Bridge. Execution and reordering assets are decrease. Penalties are larger and simpler to run into. Energy consumption was larger. Every particular person Bulldozer drawback may not be big by itself, however they stack on high of one another to place AMD method behind Intel. For instance, a mixture of decrease reordering capability and better cache latency means the core may have a tougher time extracting parallelism round L1 misses. However Bulldozer’s points weren’t purely resulting from difficulties confronted by AMD. Intel additionally deserves lots of credit score for the large progress they made with Sandy Bridge. We’ll cowl Sandy Bridge in one other article, however Intel was in a position to efficiently combine a batch of methods they beforehand trialed in Netburst and different architectures.
Remaining Phrases
Bulldozer makes use of an enormous seize bag of latest methods and applied sciences. In a single new structure, AMD applied a much more complicated department predictor, multithreaded the frontend and floating level unit, overhauled the out of order execution engine, added AVX assist with FMA, and dropped in a pile of different small enhancements for good measure. On the similar time, they migrated to a brand new 32 nm course of node that offered important challenges all by itself.
AMD struggled to drag all of those adjustments off whereas optimizing for energy consumption, creating some similarities to Netburst. However AMD acquired Netburst’s benefits too, as a result of Bulldozer appears to be a little bit of a proving floor for a pile of latest methods that had been later utilized in Zen. In the identical method that Netburst arguably fueled Sandy Bridge’s later success, Bulldozer arguably contributed to Zen’s success.
Space | K10 | Bulldozer | Zen | Feedback |
Department Prediction | Single stage BTB 2-level world historical past predictor |
Two stage BTB Meta predictor combining native and world historical past |
Triple stage BTB Perceptron predictor |
Bulldozer’s department predictor is sort of a stepping stone on the way in which to Zen’s |
Frontend | Fetch coupled to department predict 3-wide decode |
Decoupled fetch/department predict 4-wide decode |
Decoupled fetch/department predict, lengthy vary instruction prefetch 4-wide decode |
Zen builds on high of Bulldozer’s basis |
Rename | No transfer elimination | Vector register MOVs eradicated | GPR and vector MOVs eradicated | Once more, we see Zen constructing on Bulldozer |
OoO Execution | Hybrid ROB/RRF for Integer PRF for floating level |
PRF scheme | PRF scheme | |
FPU | 3-port FPU with FADD, FMUL, and FStore pipes | 4-port FPU with FMA | 4-port FPU with FMA, however rearranged port format | Zen 2 truly makes use of a break up higher/decrease FP RF format to put execution items nearer to the register file, a bit like Bulldozer |
L1D | 64 KB 2-way, write-back | Means-predicted, 16 KB 4-way, write-through to 4 KB WCC | Means-predicted, 32 KB 8-way, write-back | Means prediction was carried ahead to Zen, the place it did fairly effectively |
Multithreading | None | CMT. Shared FPU, Frontend, and L2 | SMT. Most main core items shared | Subsequent 15h generations helped AMD tune how numerous core elements dealt with multithreading |
Energy monitoring | Solely Llano modeled energy consumption utilizing occasions |
Energy consumption modeled by monitoring numerous core occasions | Energy consumption modeled by monitoring numerous core occasions | AMD acquired expertise with energy administration and boosting with Bulldozer |
Zen employed lots of the applied sciences that debuted in Bulldozer, however much more efficiently. After a number of generations of 15h wrestle, AMD had mastered these new methods and was in a position to make use of them effectively in Zen. Zen did present a mixture of respectable energy effectivity and single threaded efficiency, whereas beating Intel in multithreaded efficiency.
Nonetheless, a Netburst comparability isn’t utterly acceptable, as a result of Bulldozer avoids lots of Netburst’s worst flaws. Whereas Bulldozer usually has larger penalties than Sandy Bridge, they’re nowhere close to as excessive or as widespread as what we noticed on Netburst. In some nook circumstances, like masses crossing a 4K web page boundary, Bulldozer utterly avoids penalties that pop up on Sandy Bridge. Bulldozer’s efficiency is due to this fact not far off what you’d count on based mostly on its reordering capability and cache setup. In distinction, Netburst introduced huge reordering capability and execution assets, however couldn’t match K8 in efficiency per clock.
Bulldozer was undoubtedly a painful expertise, however it was a mandatory one for AMD. Like Intel’s P6 structure, AMD’s Athlon had served the corporate for greater than a decade however was clearly turning into dated and troublesome to evolve. Llano’s clock pace struggles on 32 nm are an excellent showcase of this. On the similar time, AMD in all probability understood that beating Intel’s single threaded efficiency can be out of attain, and determined to compete on multithreaded efficiency. Bulldozer finally aimed too excessive. However AMD’s engineers must be counseled for conducting as a lot as they did in a single technology, even when the sum of the components didn’t produce a formidable complete.
When you like our articles and journalism and also you wish to assist us in our endeavors then take into account heading over to our Patreon or our PayPal if you wish to toss a couple of bucks our method or if you need to speak with the Chips and Cheese workers and the folks behind the scenes then take into account becoming a member of our Discord.
Footnotes
Check Setup
Bulldozer | K10, Quad Core | K10, 6-Core | |
CPU | FX-8150 (Revision OR-B2, Household 15h, Mannequin 01h), 4.2 GHz Max Enhance | Phenom II X4 945 | Phenom II X6 1100T |
Motherboard | Gigabyte GA-990FX-Gaming | Asus M3A78-CM | Asus Crosshair V Components-Z |
Northbridge Clock | 2.2 GHz | 2 GHz | 2.8 GHz |
Reminiscence | Twin Channel DDR3-1866 10-11-10-30 Kingston KHX1866C10D3 |
Twin Channel DDR2-800 6-6-6-18 Micron 16HTF25664AY-800J1 |
Twin Channel DDR3-1600 9-9-9-27 Kingston HX316C9SRK2 |
GPU | XFX AMD Radeon HD 5850 | EVGA Nvidia GTX 730 | Sapphire AMD Radeon HD 6970 |
References
- 1. Design of the Two-Core x86-64 AMD “Bulldozer” Module in 32 nm SOI CMOS, IEEE Journal of Strong-State Circuits, Vol. 47, No. 1, January 2012
- 2. Design Options for the Bulldozer 32nm SOI 2-Core Processor Module in an 8-Core CPU, ISSCC 2011
- 3. Westmere: A Household of 32nm IA Processors, ISSCC 2010