Frontend and Execution Engine – Chips and Cheese
AMD’s K7 Athlon structure fashioned the idea of the corporate’s CPU choices for round a decade. Athlon did very properly towards Intel’s P6 based mostly Pentium III. K8 got the basics right, launched 64-bit help and an built-in reminiscence controller. And remained moderately aggressive against Netburst. However after 2006, Intel’s Core 2 obtained the fundamentals proper whereas bringing a bigger and extra superior core. In response, AMD developed their Athlon structure but once more to create the 10h household, codenamed “Greyhound” (K10), referred to as Phenom on the desktop market. We’ll cowl that one other day. However as 2010 rolled round, Intel’s Nehalem overhauled their core to core interconnect and introduced in their very own reminiscence controller. AMD discovered themselves with an outdated structure that was getting tougher and tougher to evolve. On the identical time, Intel was steadily eroding all of AMD’s conventional benefits. AMD needed to do one thing huge, or the corporate would haven’t any probability of catching Intel.
That’s the place Bulldozer is available in. As an alternative of making an attempt to push the fundamental Athlon structure additional, AMD went for a very totally different, completely fashionable design. Bulldozer was designed to be wider, deeper, and extra versatile than an developed 10h core may moderately be. In some areas, it was much more superior than its main Intel competitors, Sandy Bridge.
Immediately, we all know how that went. Every tech fanatic most likely associates numerous issues when the title “AMD FX” or “Bulldozer”. A few of them may had some optimistic experiences when it got here to overclocking however that’s about it. Most individuals know FX because the CPUs from AMD that ran gradual, scorching and have been energy hungry. Some may even know that its failure practically bankrupted AMD due to its non aggressive efficiency. It was slower than its personal predecessor, Greyhound, in single core efficiency and solely about had comparable multi core efficiency (Phenom II X6 1100T vs FX-8150) whereas having two extra “cores”. Moreover the ability draw of the chip underneath most load was additionally greater. Towards Intel’s Sandy Bridge the tip consequence turns into even questionable.
However as you may suppose, no one at AMD envisioned it that method within the planning or design phases. No engineer would ever begin working with the thought to “construct a shit product”; a latest chat with an engineer who was at AMD throughout Bulldozer’s improvement gave us further perception on what the unique targets for the structure have been. AMD initially wished Bulldozer to be like K10, however with a shared frontend and FPU. In a single structure, AMD would enhance single threaded efficiency whereas massively growing multithreaded efficiency, and transfer to a brand new 32 nm node on the identical time. However these targets have been too bold, and AMD struggled to maintain clock frequency up on the 32 nm course of. This resulted in cuts to the structure, which began to stack up.
We can be utilizing numerous code names all through the article so here’s a little overview to not get misplaced.
Core structure | Chip codename | Product codename | Product title |
Greyhound / 10h / K10 | Ridgeback, Pharao | Deneb, Thuban | Phenom, Athlon |
Bulldozer / 15h | Orochi | Zambezi | FX |
Husky / 12h (10h evolution on 32 nm) | Llano | Llano | A-3000 sequence |
Block Diagram
Bulldozer was a brand new, unconventional design, with huge variations in comparison with earlier AMD architectures. Every module was designed to run two threads. The frontend, FPU, and L2 cache are shared by two threads, whereas the integer core and cargo/retailer unit are personal to every thread. That improves space effectivity as a result of the frontend and FPU take a number of space. It additionally is sensible as a result of the frontend is greater than ample to feed a single thread, and the FPU is simply closely utilized in sure purposes.
From a bodily viewpoint, a Bulldozer module does vaguely resemble K10 with a shared frontend and FPU. Like K10 and former Athlon-derived architectures, the core is laid out with fetch, decode, integer execute, load/retailer, and FPU blocks operating throughout the width of the core, in that order.
Every Bulldozer module occupies 30.9 mm2, together with the L2 cache, and incorporates 213 million transistors. On the identical course of node, a Llano core takes 9.69 mm2 excluding the L2 cache.
Frontend: Department Prediction
The department predictor sits on the very begin of the CPU’s pipeline, and is liable for shortly and precisely telling it the place to fetch directions from. AMD had a stable department predictor when Athlon debuted in 1999, however Intel had expended appreciable effort into creating extra department predictors. This wasn’t an enormous drawback for AMD within the early 2000s, as a result of the potential in Netburst’s superior department predictor was hidden behind that structure’s flaws. However by the late 2000s, K8’s department predictor was no match against Core 2’s, and the P6 based mostly Core 2 didn’t have a pile of present stopping penalties. AMD’s drawback solely elevated as time went on, so Bulldozer obtained an enormous department predictor overhaul.
Bulldozer | K10 | |
Path Predictor | Hybrid predictor. a) two-level predictor utilizing 12 bits of worldwide historical past and a couple of bits of department deal with = 16384 entry historical past desk (?) b) predictor utilizing native historical past Selects between (a) and (b) with a meta predictor |
Two-level predictor utilizing world historical past, with 16384 entry historical past desk Probably indexed using 8 bits of global history and 3 bits of branch address |
Return Predictor | 24 entry return stack per thread | 24 entry return stack |
Department Location Monitoring | Decoupled from instruction cache. Department predictor can run forward of fetch and populate a queue of fetch targets. | Tied to instruction cache. Predecode data consists of department selectors. Restrictions on what number of branches could be tracked per 16B aligned block |
Department Goal Caching | Two stage department goal buffer (BTB) Degree 1: 512 entry, 2 cycle latency Degree 2: 5120 entry, 5 cycle latency |
Single stage department goal buffer with 2048 entries, 2 cycle latency |
Oblique Department Prediction | 512 entry oblique goal array | 512 entry oblique goal array |
In nearly each space, Bulldozer is an enormous enchancment over K10. Sadly for AMD, Intel managed to proceed iterating at a formidable tempo, and the department predictor isn’t any exception. Sandy Bridge has a superb department predictor, and might acknowledge longer patterns than Bulldozer.
Sandy bridge can be higher at dealing with giant department footprints, and might nonetheless deal with average historical past lengths with 512 branches. Bulldozer actually begins to wrestle at that time. Indexing into the historical past desk with a 2-bit hash of the department deal with most likely results in a number of aliasing with a number of branches in play.
Department Goal Monitoring
Department predictors should be quick as properly, to keep away from stalling the pipeline an excessive amount of ready for the subsequent department goal. Bulldozer focuses on growing department monitoring capability in comparison with K10, reasonably than pace. Nonetheless, AMD did decouple the department predictor from the instruction fetch stage, giving the frontend a bit extra queueing capability to forestall department predictor delays from ravenous the pipeline. As an alternative of getting predictions on the L1i fetch stage, the predictor runs forward and populates a queue of fetch targets for every thread.
In idea, that ought to permit Bulldozer to retain excessive instruction fetch bandwidth even within the face of instruction cache misses, so long as the BTB is giant sufficient to cowl an utility’s department footprint, and the course predictor is correct sufficient. Nonetheless, our testing reveals that’s not the case. We see taken department latency enhance because the check loop spills out of the 64 KB L1i, indicating that the prediction queue or the L1i miss queue don’t have sufficient entries to cover L2 latency. AMD states that on an L1i miss, Bulldozer does next-line prefetch. If that’s true, and Bulldozer doesn’t use the prediction queue to constantly queue up L1i miss requests, that might clarify the poor conduct we see when fetching code from L2.
Decoupling the department predictor from L1i additionally removes the department monitoring restrictions that plagued K7, K8, and K10. These architectures may fail to foretell a department as a result of the predictor can’t observe it. Compilers labored round this to some extent by padding or utilizing longer encodings. After all, that’s a messy resolution as a result of it sacrifices code density. Bulldozer removes that limitation.
Nonetheless, AMD did little to enhance the department predictor’s pace. The brand new two-level BTB association supplies extra department goal monitoring capability, however the second stage BTB could be very gradual. Taking a department goal from it prices 5 cycles. The primary stage BTB is smaller than the one in K10, however no sooner. Bulldozer is thus unable to deal with taken branches again to again, which means that loop unrolling remains to be vital for AMD.
Sandy Bridge in distinction can deal with as much as eight taken branches with out stalling the frontend in any respect. Intel additionally has a bigger 4096 entry L1 BTB that’s as quick because the 512 entry L1 BTB on Bulldozer. So regardless that each architectures have a 4-wide frontend on paper, Bulldozer goes to lose much more frontend throughput round taken branches.
Bulldozer and Sandy Bridge can each cover taken department latencies to some extent when operating two threads in a module or core. With one other unbiased chain of department targets, the L1 BTBs on each CPUs can present a taken department goal each cycle. Nonetheless, Sandy Bridge nonetheless has a major benefit if Bulldozer has to hit the L2 BTB, which may solely present a taken department goal as soon as each three cycles.
For return prediction, Bulldozer has a 24 entry return stack. Up to date Intel architectures have 16 entry return stacks, so Bulldozer’s is sort of giant. The return stack is duplicated to deal with two threads, identical to in Sandy Bridge.
Frontend: Fetch and Decode
Not all the pieces about K10 wanted an overhaul. Like K10, Bulldozer makes use of a big 64 KB, 2-way instruction cache. Bulldozer’s L1i is bodily carried out as an 8×2 array of 4 KB financial institution macros, utilizing 8T SRAM. Predecode data is saved alongside the instruction cache in two separate arrays, indicating that 8 KB of storage is used to retailer predecode information. The instruction cache can ship 32 bytes per cycle, though a single thread can’t make full use of that bandwidth. With 8 byte NOPs and one thread, we common 22-23 bytes per cycle of instruction bandwidth. Working two threads in a module brings that as much as nearly 32 bytes per cycle, in keeping with AMD’s documentation.
Sandy Bridge takes a special, excessive tech method to instruction supply. A 1536 entry micro-op cache holds decoded directions, and might successfully present 32 instruction bytes per cycle. This micro-op cache has its roots in Netburst’s hint cache, however doesn’t attempt to operate as a L1i. As an alternative, Sandy Bridge backs the op cache with a standard 32 KB instruction cache able to delivering 16 bytes per cycle. Not like Bulldozer, Sandy Bridge doesn’t want a second thread to hit full fetch bandwidth. However the micro-op cache is comparatively small, and Bulldozer may have a fetch bandwidth benefit if each CPUs have to drag directions from their L1i caches.
Each CPUs undergo when operating code out of L2 or L3, however Bulldozer is worse. Like Phenom, Bulldozer exceeds 4 bytes per cycle on common from L2. Working two threads within the module doesn’t assist L2 bandwidth, however barely helps in L3 sized areas. Sandy Bridge isn’t nice both, however does obtain higher code fetch bandwidth out of its L2 and L3 caches. The 2 architectures due to this fact commerce blows relying on code footprint sizes. Bulldozer is harm extra by L1i misses, however has a bigger L1i that ought to undergo from fewer misses. Sandy Bridge is best at both very small or very giant code footprints.
With shorter 4 byte NOPs, that are extra consultant of typical instruction lengths present in scalar integer code, fetch bandwidth concerns just about go away. Bulldozer’s frontend can ship 4 directions per cycle from the instruction cache, no matter whether or not each threads within the module are energetic. Sandy Bridge can too, so long as it doesn’t undergo instruction cache misses.
Previous L1i, Bulldozer is caught at about 1 IPC. Absolute instruction fetch bandwidth is generally unchanged from the 8 byte NOP check. Once more, we see Bulldozer’s benefit with a big instruction cache. Sandy Bridge does higher with very giant code footprints, particularly if an utility can use sufficient threads to benefit from SMT.
Rename/Allocate
After the frontend has introduced directions into the core, the renamer is liable for allocating sources within the backend to trace them for out of order execution. The rename stage can be a handy place to drag some methods that expose extra instruction stage parallelism to the backend. For instance, sure operations like subtracting a price from itself or XOR-ing a price with itself will at all times lead to zero, and are generally used to zero a register. There’s no want to attend for the earlier worth of the register to be prepared earlier than executing such an operation, and the renamer could make that clear to the backend. Bulldozer’s renamer can break dependencies on this case to show extra parallelism to the backend, however can’t remove zeroing idioms the best way Sandy Bridge can.
Case | Bulldozer | Sandy Bridge | |
Zeroing Idioms (XOR r,r; SUB r,r; XORPS xmm, xmm) | Dependency damaged ALU pipe used Bodily register allotted |
Dependency damaged No ALU pipe used No bodily register allotted |
|
Scalar integer register to register copy | Not optimized | Not optimized | |
Vector register to register copy | Dependency damaged No execution pipe used No bodily register allotted |
Not optimized |
Transfer elimination is one other trick. As a result of Bulldozer’s ROB holds tips to bodily registers, copying values between registers could be so simple as having the renamer level two architectural registers to the identical bodily one. Bulldozer can do that for vector registers, letting it “execute” register to register copies inside the renamer.
x86 CPUs needed to do some type of transfer elimination on the floating level facet for many years, with the intention to break dependencies when coping with the x87 register stack. AMD has lengthy used a separate renamer to do that inside the floating level unit, and increasing that to remove MOVs with SSE registers most likely wasn’t a lot of a stretch. Eradicated strikes don’t devour an execution pipe and don’t require a bodily register to be allotted. There’s sadly no transfer elimination on the scalar integer facet, however for perspective, Sandy Bridge has no transfer elimination in any respect.
Out of Order Execution
AMD utterly overhauled the out of order execution engine in Bulldozer. Athlon and Phenom used a hybrid scheme, with a ROB+RRF setup for the integer facet, and a PRF setup for the FP/vector facet. Bulldozer ditches this and goes all in with a contemporary PRF scheme in all places. Intel did the identical factor with Sandy Bridge, ditching Nehalem’s ROB+RRF setup, which dates again to Intel’s outdated P6 structure, in favor of a PRF scheme.
Within the outdated ROB+RRF scheme, renamed registers are merely consequence fields within the ROB. That makes renaming easy, as a result of each instruction allotted into the backend merely writes its consequence again into the ROB. The renamer doesn’t have to fret about discovering free entries in a separate register file. Storing leads to the ROB additionally means you’ll be able to’t run out of registers for renaming till the ROB fills, simplifying tuning and efficiency evaluation.
However this scheme has disadvantages too. When directions retire, their outcomes should be bodily copied right into a separate retired register file (RRF). Register depend additionally has to develop with ROB measurement, which isn’t perfect as a result of a number of directions like compares, branches, and shops don’t want to write down a consequence right into a register. Renamed registers or consequence subject slots for these directions find yourself being unused. In distinction, the PRF based mostly scheme in Bulldozer and Sandy Bridge shops leads to separate bodily register information. The ROB solely holds tips to the register file entries. When directions are retired, solely the pointers should be copied. That reduces information motion, particularly when coping with giant vector registers. With the PRF scheme, Intel and AMD have been each in a position to massively enhance reordering capability.
Construction | Bulldozer | Sandy Bridge | Phenom (10h) |
Reorder Buffer | 128 entry, replicated per thread | 168 entry | 72 entry |
Integer Registers | 96 entry(~68 speculative), replicated per thread | 160 entry | Tied to ROB |
FP/Vector Registers | 160 entry (~107 speculative for XMM, ~51 for YMM), competitively shared | 144 entry | 120 entry |
Load Queue | 40 entry, replicated per thread | 64 entry | 32 entry (shared with shops) |
Retailer Queue | 24 entry, replicated per thread | 36 entry | 32 entry (shared with hundreds) |
Department Order Buffer | 42 entry | 48 entry | N/A |
Scheduler | 40 Integer + 60 FP | 54 entry | 3×8 Integer + 42 FP |
AMD’s case is particularly important, as a result of the ROB measurement will increase by a whopping 77% over K10. K10 bumped into ROB capability limits so much, and Bulldozer goals to deal with that.
K10 may additionally undergo closely from department mispredicts, as a result of the backend couldn’t settle for new directions till the mispredicted department was retired. That’s as a result of K10 solely stored two units of register alias tables (RAT) – one for retired state, and one for the newest speculative state on the renamer. After a department mispredict, the speculative RAT would clearly be incorrect. To recuperate that situation, K10 would let the mispredicted department retire, after which copy known-good retired state to speculative state. However up till the mispredicted department retired, the backend wouldn’t be capable to allocate sources for any directions coming from the frontend. That’s the “department abort stall” above.
Bulldozer avoids that sort of stall by maintaining snapshots of RAT states, in a construction known as a “mapper checkpoint array”. After a mispredict, Bulldozer can restore RAT state utilizing a checkpoint, as an alternative of ready for the department to retire. The backend can then begin accepting directions from the frontend even whereas the mispredicted department remains to be in-flight. In some circumstances, the latency of department mispredict could be utterly hidden behind different lengthy latency directions.
Integer Execution
Different K10 weaknesses have been corrected in Bulldozer too. K10 had small, distributed schedulers for its integer facet, and these may typically refill. Bulldozer switches to a 40 entry, 4 port, unified scheduler. This scheduler covers 31% of reorder buffer capability, making it much like Sandy Bridge’s in that respect. Nonetheless, Sandy Bridge makes use of the identical scheduler to deal with floating level and vector operations.
However implementing a big scheduler shouldn’t be straightforward. To maintain clock accelerates and energy consumption down, AMD made modifications to the scheduling algorithm. Usually, CPU designers attempt to verify the oldest prepared instruction sitting within the scheduler will get executed, equally to how a restaurant might attempt to prioritize a buyer who has been ready the longest.
Selecting the oldest instruction first is a identified good heuristic as it’s extra possible that an older instruction blocks execution of later dependent operations. Nonetheless, an oldest-first heuristic requires monitoring the age of entries within the scheduler, which has a {hardware} value.
Henry Wong, A Superscalar Out-of-Order x86 Delicate Processor for FPGA
Intel’s P6 architecture used a collapsing priority queue to verify the schedulers at all times despatched the oldest prepared instruction for execution. AMD’s Bulldozer presentation at ISSCC 2011 suggests a few of AMD’s earlier architectures might have executed the identical. Nonetheless, a shifting, collapsing construction could be too energy hungry. So, Bulldozer makes use of an ancestry desk that tracks the oldest instruction and prioritizes it for execution. This technique ought to get a number of the advantages of a real oldest-first scheme, whereas avoiding the prices of a collapsing precedence queue. If the oldest instruction isn’t able to execute, different directions are chosen relying on the place they bodily are within the scheduler, which doesn’t correspond to age.
With these optimizations, Bulldozer implements a giant unified scheduler without having a pile of additional pipeline phases. Distinction that with Netburst, which has greater than twice as many pipeline phases from rename to execute.
AMD made different modifications to extend clock pace with out extreme pipelining, and these are fairly seen within the per-thread integer execution engine. Moreover switching to a PRF scheme to reduce information motion, Bulldozer duplicates the integer register file to scale back crucial path lengths. Every register file copy has 4 learn ports and 4 write ports. Reads can come from both register file relying on the execution pipe concerned, whereas writes are written into each register file copies. The integer RF thus successfully has eight learn ports and 4 write ports. Many of the scheduling buildings are parity protected.
Bulldozer’s integer execution items are reasonably gentle in comparison with each Sandy Bridge and K10. There are solely two ALU pipes able to dealing with frequent operations like provides and compares. Execution unit throughput is often not a bottleneck, particularly with K10, which had over-provisioned integer execution sources to allow its easy three-lane structure. However AMD may need gone too far within the house saving course. Bulldozer’s frontend can ship 4 directions per cycle to a single thread, and most different 4-wide CPUs have three or extra ALUs.
FP and Vector Execution
Bulldozer’s FPU can settle for 4 operations per cycle from the frontend. The 4 operations that are available in on a single cycle should be from a single thread, but when two threads are energetic and utilizing FP/vector directions, the FPU can swap which thread it’s receiving operations for each cycle.
As a result of the FPU is designed to deal with two threads, its out-of-order bookkeeping sources are overpowered for a single thread. The 60 entry unified scheduler by itself is bigger than the 54 entry unified scheduler that Sandy Bridge makes use of to deal with all directions, and far bigger than the 42 entry FPU scheduler in K10. Bulldozer’s FP register file isn’t any joke both, with 160 entries. That’s sufficient to cowl most of a single thread’s ROB, after excluding registers used to carry architectural state throughout each threads. K10 solely has 120 FP registers and Sandy Bridge solely has 144.
Bodily, the register file is cut up into two 10-bank arrays positioned on both facet of the FPU, the place they’re near essentially the most generally used vector and FP execution items. One array is wider than the opposite to deal with 80-bit x87 operations. Models that should entry lanes throughout a whole vector register are positioned between the 2 register file arrays. AMD’s ISSCC 2011 presentation states that the FP register file can help 10 reads and 6 writes per cycle2, and has a complete of 13 learn and seven write buses to the execution items1. That needs to be sufficient bandwidth to feed the FPU’s 4 execution pipes.
Not like some SMT implementations, there’s no strict partitioning or watermarking of FPU registers or scheduler capability when a module has two threads energetic. In truth, if one thread is simply operating integer directions, FPU operation is indistinguishable from single threaded mode.
AMD says the FPU is thread agnostic, and it does really feel just like the FPU is solely handed directions from the frontend and instructed which integer core to report completion to. If each threads are utilizing the FPU, its sources are competitively shared. We see a really gradual enhance within the complete latency of two cache misses, as an alternative of a pointy enhance when the register file or scheduler capability is exceeded.
As we get in the direction of most scheduler or RF capability, every thread has a decrease chance of getting the reordering capability it must execute each lengthy latency hundreds in parallel. Distinction that with Sandy Bridge, the place the FP register file is strictly partitioned with each threads energetic.
Sandy Bridge’s scheduler seems to be watermarked, in order that one thread shouldn’t be allowed to make use of greater than about 40 entries when its sibling thread is energetic.
Bulldozer’s FPU due to this fact has a fairly easy SMT implementation. Not like Sandy Bridge, it doesn’t reconfigure itself relying on whether or not it’s dealing with one or two threads. The load balancing coverage could possibly be so simple as letting the frontend arbitrate between two threads, and throttling FPU instruction supply for one thread if obligatory to make sure equity. This simplicity can include corner-case benefits too, like letting one thread get unrestricted entry to FPU sources if the sibling thread is simply executing scalar integer directions.
FPU Execution Models, and AVX Implementation
We just lately lined the AVX-512 implementations in Zen 4 and Cannon Lake, so it’s solely becoming that we cowl Bulldozer’s introduction of AVX to AMD’s lineup. AVX extends the vector registers to 256-bit and provides floating level directions that function on 256-bit vectors. Bulldozer decodes 256-bit directions into two 128-bit micro-ops, and tracks them all through the pipeline. The principle profit from AVX is due to this fact elevated code density.
In distinction, Sandy Bridge implements 256-bit bodily registers and FP execution items, giving Intel much better reordering capability and throughput for AVX code. Nonetheless, Bulldozer does have a trick up its sleeve with FMA (fused multiply add) help. A FMA instruction computes a*b+c in a single go, benefiting from how the final step of a multiplication operation includes including a batch of partial sums. With FMA, a Bulldozer module can match a Sandy Bridge core’s floating level throughput.
After all, this solely works if you should use the output of a multiply operation in an add operation. Two completely unbiased add and multiply operations received’t profit. One other drawback is that Bulldozer makes use of FMA4 directions, which specify a d = a*b+c operation. Intel by no means supported FMA4. When Haswell launched FMA to Intel’s lineup, it used FMA3, which means that one of many supply operands could be overwritten. Intel’s dominant market place meant FMA4 by no means gained widespread software program help. AMD launched FMA3 help with Piledriver, and software program standardized on that.
Bulldozer | Sandy Bridge | K10 | |
FP Add | 2×128-bit 5 cycle latency |
1×256-bit 3 cycle latency |
1×128-bit 4 cycle latency |
FP Multiply | 2×128-bit 5 cycle latency |
1×256-bit 5 cycle latency |
1×128-bit 4 cycle latency |
FP Fused Multiply Add | 2×128-bit 6 cycle latency |
Nope Can chain multiply+add for 1×256-bit throughput, 8 cycle latency |
Nope Can chain multiply+add for 1×128-bit throughput, 8 cycle latency |
Vector Integer Add | 2×128-bit 2 cycle latency |
2×128-bit 1 cycle latency |
2×128-bit 2 cycle latency |
Bulldozer’s FMA latency is reasonably excessive at 6 cycles, most likely as a result of it’s AMD’s first try at a FMA implementation. Concentrating on excessive clock speeds on a disappointing course of node most likely didn’t assist issues both. However in equity to Bulldozer, Haswell loved a significantly better course of node with Intel’s 22 nm and was just one cycle higher with 5 cycle FMA latency.
Context issues too. Software program takes time to undertake a brand new ISA extension, so AVX utilization was very uncommon proper after Bulldozer and Sandy Bridge’s launches. Bulldozer’s FPU setup makes a number of sense for current software program that solely makes use of scalar or 128-bit vector directions. For instance, Bulldozer’s FMA items are additionally used to deal with provides and multiplies, which means that two ports can be found for each operations. Sandy Bridge solely has one FP add port and one FP multiply port, making it liable to port bottlenecks if FP code doesn’t have a roughly even mixture of provides and multiplies.
General, AMD’s FPU is a big enchancment over the one current in K10. Throughput is elevated by a extra versatile pipe structure, the place two pipes can deal with each FP provides and multiplies. For low-threaded efficiency, the FPU may give every thread huge sources for FPU reordering, such that reordering limits are prone to be hit elsewhere. Even with two threads energetic, the FPU may give every thread fairly respectable reordering capability. The best weak point in Bulldozer’s FPU is probably going its execution latency. However FP code tends to be much less latency delicate, and the FPU has loads of scheduling capability to soak up brief time period spikes. Scheduler capability will in fact be shared if two threads are utilizing the FPU, however in that case, execution latency is mitigated by the specific parallelism offered by a second thread.
Load/Retailer
Every Bulldozer thread executes reminiscence operations by a pair of AGU pipes. Like those in prior AMD CPUs, these AGUs are comparatively highly effective and might deal with listed addressing with no penalty. The load/retailer unit tracks in-flight reminiscence operations utilizing separate queues for hundreds and shops. The load queue has 40 entries, and the shop queue has 24 entries. Bulldozer’s method has extra in frequent with Sandy Bridge, which additionally has separate load and retailer queues (simply bigger ones than Bulldozer). In K10, a single unified queue dealt with each hundreds and shops. Splitting up the load and retailer queues might be an acknowledgement that K10’s scheme was inefficient. Hundreds are way more frequent than shops, and monitoring shops requires extra storage. That’s as a result of pending retailer information additionally needs to be stored round, along with the shop deal with.
As soon as addresses have been generated, the load/retailer unit checks to verify reminiscence dependencies are glad. That’s, an in-flight load might get its information forwarded from an older retailer. Bulldozer’s mechanism for doing that is an enchancment over K10. It not suffers from false dependencies when the load and retailer each contact the identical 4B aligned area, however don’t truly overlap. With scalar integer reminiscence accesses, Bulldozer can do quick forwarding in all circumstances the place the load and retailer deal with match precisely. K10’s forwarding mechanism would fail if both reminiscence entry crossed a 16 byte boundary. However this benefit isn’t so clear lower, as a result of K10 usually has decrease latencies. If the load is misaligned, retailer forwarding takes 13-14 cycles on Bulldozer. That’s significantly better than the 35-39 cycle failure case, however K10’s failure case takes 12-13 cycles. Bulldozer solely has a bonus if the shop is misaligned, however the load isn’t. There, K10 takes a ten cycle penalty, whereas Bulldozer can deal with it in 8 cycles (which is identical because the comfortable path forwarding latency). In circumstances the place each architectures can pull off fast-path forwarding, K10 is mostly sooner with 4-5 cycle forwarding latency.
In comparison with K10, Bulldozer has extra strong checks, but additionally suffers greater latencies – particularly if these checks fail. K10 is way easier and has to hit a gradual path extra typically, however has a brief pipeline and recovers comparatively shortly. Failed retailer forwarding on Bulldozer usually incurs a hefty 35 cycle penalty. This will increase to 39 cycles if the load is misaligned, and might attain 42-43 cycles if the load crosses a 64B cacheline boundary. K10’s worst penalty is 12-13 cycles when each the load and retailer are misaligned. Elsewhere, the failure case is mostly 10-11 cycles.
In circumstances the place hundreds are unbiased and no forwarding is required, Bulldozer’s extra strong load/retailer unit could make it sooner too. On K10, misaligned accesses carry a major penalty, with a pair of misaligned hundreds and shops finishing as soon as each 3-4 cycles. On Bulldozer, the information cache most likely has three separate ports (two learn and one write), giving it extra bandwidth to deal with misaligned accesses. Nonetheless, Bulldozer does take a slight penalty with misaligned hundreds that come inside a 4B aligned 32B sector accessed by a retailer. There’s most likely some sort of coarse, quick path test in Bulldozer, with a extra thorough test for forwarding taking place if the preliminary test signifies a doable overlap.
Sandy Bridge has the same coarse, quick test. However Intel does this on aligned 4 byte boundaries, making it a finer test than Bulldozer’s. Throughout the board, Sandy Bridge’s load/retailer unit is much extra versatile and strong. It might probably do fast-path retailer forwarding for all circumstances the place the load is contained inside the retailer, even with misaligned accesses. As a cherry on prime, Intel usually enjoys decrease penalties too. In comparison with Bulldozer, the quick forwarding case is quicker (6 cycles vs 8), and the failure case is much less punishing (17-25 cycles, vs 36-42 cycles).
Intel additionally suffers much less from misaligned entry penalties, and solely has to do additional L1D accesses if a load or retailer crosses a 64B cacheline boundary. Like K10, Bulldozer’s L1D usually handles operations in 16B aligned chunks, making it extra vulnerable to misaligned entry penalties.
Nonetheless, Bulldozer nonetheless handles a couple of nook circumstances higher. Like with K10, there’s no additional penalty if a load crosses a 4K web page boundary, whereas Sandy Bridge takes a 35 cycle penalty. The scenario flips with shops, the place Sandy Bridge can deal with split-page writes with a 25 cycle penalty. Bulldozer takes 37-38 cycles to take action. That represents a big regression from K10, which didn’t undergo any notable 4K web page crossing penalties. I’m guessing that AMD’s TLBs are in a position to learn out two entries in a single cycle, whereas Intel’s can’t. Nonetheless, Bulldozer’s write by L1D most likely throws a wrench within the works when coping with shops.
4K Aliasing
Bulldozer and Sandy Bridge each solely test a subset of deal with bits when initially figuring out whether or not reminiscence accesses are unbiased, and better deal with bits should not included. Particularly, they don’t test greater than the primary 12 bits. That is sensible, since you don’t even know what the upper bits are till you’ve completed a TLB lookup. In any case, there’s nothing incorrect with aliasing a number of pages in digital deal with house to a single web page in bodily reminiscence (though doing so might make you a horrible individual).
Each cores due to this fact lose ILP from false dependencies when a load and retailer are spaced by 4096 bytes, regardless that the accesses don’t overlap. However Bulldozer suffers far greater penalties when this occurs, usually taking 16 cycles to type itself out and notice there’s truly no dependency. The penalty can go as excessive as 27 cycles if the accesses are misaligned. This isn’t as dangerous because the failed retailer forwarding penalty, however nonetheless excessive sufficient to recommend that Bulldozer’s not doing a full deal with comparability till fairly late within the load/retailer pipeline.
In distinction, Sandy Bridge appears to determine that all the pieces is definitely wonderful fairly early in its load/retailer pipeline, sometimes leading to a 3-4 cycle penalty for the load. If each accesses are misaligned, this penalty will increase to 8-9 cycles, however that’s nonetheless higher than on Bulldozer.
Transient Phrases on Half 1
AMD has carried out a contemporary execution engine into the Bulldozer core. In comparison with Athlon, the Bulldozer basis offers AMD’s engineers a number of flexibility to allocate sources. Bulldozer addresses a number of Athlon’s conventional bottlenecks, like low integer scheduling capability.
Not all the modernization efforts can present their full potential. For instance, enhancements within the department predictor and cargo/retailer unit are blunted by greater latency. A deal with multithreaded efficiency meant sacrifices to the reordering capability and integer execution sources obtainable to a single thread. However from an architectural “bling” perspective, Bulldozer undeniably has extra in frequent with at this time’s architectures than the likes of Athlon or P6.
On the identical time, AMD introduced a few of Athlon’s strengths ahead. Bulldozer’s frontend retains a excessive capability 64 KB L1i with predecode data. However the core is simply a part of the story. Out of order execution goals to maintain the execution items fed within the face of cache and reminiscence latency, and we’ll take a look at what the core has to take care of in Half 2.
In the event you like our articles and journalism and also you wish to help us in our endeavors then contemplate heading over to our Patreon or our PayPal if you wish to toss a couple of dollars our method or if you want to speak with the Chips and Cheese workers and the folks behind the scenes then contemplate becoming a member of our Discord.