Now Reading
The 8086 processor’s microcode pipeline from die evaluation

The 8086 processor’s microcode pipeline from die evaluation

2023-01-10 12:26:29

Intel launched the 8086 microprocessor in 1978, and its affect nonetheless stays by means of the favored x86 structure.
The 8086 was a reasonably complicated microprocessor for its time, implementing directions in microcode with pipelining to enhance efficiency.
This weblog publish explains the microcode operations for a specific instruction, “ADD fast”.
Because the 8086 documentation will inform you, this instruction takes 4 clock cycles to execute.
However trying internally reveals seven clock cycles of exercise.
How does the 8086 match seven cycles of computation into 4 cycles? As I’ll present, the trick is pipelining.

The die photograph under reveals the 8086 microprocessor below a microscope.
The metallic layer on high of the chip is seen, with the silicon and polysilicon principally hidden beneath. Across the edges of the die, bond wires join pads to
the chip’s 40 exterior pins.
Architecturally, the chip is partitioned right into a Bus Interface Unit (BIU) on the high
and an Execution Unit (EU) under, which might be necessary within the dialogue.
The Bus Interface Unit handles reminiscence accesses (together with instruction prefetching), whereas the Execution Unit executes directions.
The purposeful blocks labeled in black are those which are a part of the dialogue under.
Particularly, the registers and ALU (Arithmetic/Logic Unit) are on the left and the massive microcode ROM is within the lower-right.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

The 8086 die below a microscope, with major purposeful blocks labeled. This photograph reveals the chip’s single metallic layer; the polysilicon and silicon are beneath. Click on on this picture (or another) for a bigger model.

Microcode for “ADD”

Most individuals consider machine directions as the essential steps that a pc performs.
Nonetheless, many processors (together with the 8086) have one other layer of software program beneath: microcode.
The motivation is that directions often require a number of steps contained in the processor.
One of many hardest elements of laptop design is creating the management logic that directs the processor for every step of an instruction.
The easy strategy is to construct a circuit from flip-flops and gates that strikes by means of the assorted steps and generates the management indicators.
Nonetheless, this circuitry is sophisticated, error-prone, and exhausting to design.

The choice is microcode: as a substitute of constructing the management circuitry from complicated logic gates, the management logic is basically changed with code.
To execute a machine instruction, the pc internally executes a number of less complicated micro-instructions, specified by the microcode.
In different phrases, microcode kinds one other layer between the machine directions and the {hardware}.
The principle benefit of microcode is that it turns the processor’s management logic right into a programming job as a substitute of a troublesome logic design job.

The 8086 makes use of a hybrid strategy: though the 8086 makes use of microcode, a lot of the instruction performance is carried out with gate logic.
This strategy eliminated duplication from the microcode and saved the microcode sufficiently small for 1978 expertise.
In a way the microcode is parameterized.
For example, the microcode can specify a generic ALU operation, and the gate logic determines from the instruction which ALU operation to carry out.
Likewise, the microcode can specify a generic register and the gate logic determines which register to make use of.
The best directions (resembling prefixes or condition-code operations) do not use microcode in any respect.
Though this made the 8086’s gate logic extra sophisticated, the tradeoff was worthwhile.

The 8086’s microcode was disassembled by Andrew Jenner (link) from my die pictures, so we are able to see precisely what micro-instructions the 8086 is operating for every machine instruction.
On this publish, I’ll give attention to the ADD instruction, since it’s pretty simple.
Particularly, the “ADD AX, fast” instruction comprises a 16-bit worth that’s added to the worth within the 16-bit AX register.
This instruction consists of three bytes: the opcode 05, adopted by the two-byte fast worth.
(An “fast” worth is included within the instruction, reasonably than coming from a register or reminiscence location.)

This ADD instruction is carried out within the 8086’s microcode as 4 micro-instructions, proven under.
Every micro-instruction specifies a transfer operation throughout the inner ALU bus. It additionally specifies an motion.
In short, the primary two directions get the fast argument from the prefetch queue.
The third instruction will get the argument from the AX register and begins the ALU (Arithmetic/Logic Unit) operation.
The ultimate instruction shops the end result into the AX register and updates the situation flags.

µ-address    transfer        motion
   018    Q → tmpBL     L8    2
   019    Q → tmpBH
   01a    M → tmpA      XI    tmpA, NXT
   01b    Σ → M         RNI   FLAGS

Intimately, the primary instruction strikes a byte from the prefetch queue (Q) to one of many ALU’s non permanent registers, particularly the low byte of the tmpB register.
(The ALU has three non permanent registers to carry arguments: tmpA, tmpB, and tmpC.
These non permanent registers are invisible to the programmer and are unrelated to the AX, BX, CX registers.)
Likewise, the second instruction fetches the excessive byte of the fast worth from the queue and shops it within the excessive byte of the ALU’s tmpB register.
The motion within the first micro-instruction, L8, will department to step 2 (01a) if the instruction specifies an 8-bit operation, skipping the load of the excessive byte.
Thus, the identical microcode helps the 8-bit and 16-bit ADD directions.1

The third micro-instruction is extra sophisticated. The transfer part strikes the AX register’s contents (indicated by M) to the accumulator’s tmpA register, getting each arguments prepared for
the operation. XI tmpA begins an ALU operation, on this case including tmpA to tmpB.2
Lastly, NXT signifies that that is the next-to-last micro-instruction, as might be mentioned under.

The final micro-instruction shops the ALU’s end result (Σ) into the AX register.
The tip of the microcode for this machine instruction is indicated by RNI (Run Subsequent Instruction). Lastly, FLAGS causes the 8086’s situation flags register to be up to date,
indicating if the result’s zero, unfavourable, and so forth.

You might have observed that the microcode would not explicitly specify the ADD operation or the AX register, utilizing XI and M as a substitute.
This illustrates the “parameterized” microcode talked about earlier.
The microcode specifies a generic ALU operation with XI,3 and the {hardware} fills within the specific ALU operation from bits 5-3 of the machine instruction.
Thus, the microcode above can be utilized for addition, subtraction, exclusive-or, comparisons, and 4 different arithmetic/logic operations.

The opposite parameterized facet is the generic M register specification.
The 8086’s instruction set has a versatile manner of specifying registers for the supply and vacation spot of an operation:
registers are sometimes specified by a “Mod R/M” byte, however can be specified by bits within the first opcode.
Furthermore, many directions have a bit to modify the supply and vacation spot, and one other bit to specify an 8-bit or 16-bit register.
The microcode can ignore all this; a micro-instruction makes use of M and N for the supply and vacation spot registers, and the {hardware} handles the small print.4
The M and N values are carried out by 5-bit registers which are invisible to the programmer and specify the “actual” register to make use of. The diagram under reveals how they seem on the die.

Die photo of the circuitry that implements the M and N registers. A multiplexer selects a source for the N register value and feeds it into the 5-bit N register. The M register is similar. Between the two registers is a "swap" circuit to swap the outputs of the two registers based on the instruction's "direction" bit. In this image, the metal layer has been dissolved with acid to show the transistors in the silicon layer underneath.

Die photograph of the circuitry that implements the M and N registers. A multiplexer selects a supply for the N register worth and feeds it into the 5-bit N register. The M register is analogous. Between the 2 registers is a “swap” circuit to swap the outputs of the 2 registers based mostly on the instruction’s “route” bit. On this picture, the metallic layer has been dissolved with acid to indicate the transistors within the silicon layer beneath.


The 8086 documentation says this ADD instruction takes 4 clock cycles, and as we now have seen, it’s carried out with 4 micro-instructions.
One micro-instruction is executed per clock cycle, so the timing appears simple.
The issue, nonetheless, is {that a} micro-instruction cannot be accomplished in a single clock cycle.
It takes a clock cycle to learn a micro-instruction from the microcode ROM.
Sending indicators throughout an inside bus sometimes takes a clock cycle and different actions take extra time.
So a typical micro-instruction finally ends up taking 2½ clock cycles from begin to finish.
One answer could be to decelerate the clock, so the micro-instruction can full in a single cycle, however that might drastically scale back efficiency.
A greater answer is pipelining the execution so a micro-instruction can full each cycle.5

The thought of pipelining is to interrupt instruction processing into “levels”, so totally different levels can work on totally different directions on the identical time.
It is type of like an meeting line, the place a specific automotive would possibly take an hour to fabricate, however a brand new automotive comes off the meeting line each minute.
The diagram under reveals a easy instance. Suppose executing an instruction requires three steps: A, B, and C.
Executing 4 directions, as proven on the high would take 12 steps in complete.

Diagram of a simple pipeline showing four instructions executing through three stages.

Diagram of a easy pipeline displaying 4 directions executing by means of three levels.

Nonetheless, suppose the steps can execute independently, so step B for one instruction can execute concurrently step A for an additional instruction.
Now, as quickly as instruction 1 finishes step A and strikes on to step B, instruction 2 can begin step A.
Subsequent, instruction 3 begins step A as directions 2 and 1 transfer to steps B and C respectively.
The primary instruction nonetheless takes 3 time models to finish, however after that, an instruction completes each time unit, offering a theoretical 3× speedup.6
In a bit, I’ll present how the 8086 makes use of the thought of pipelining.

The prefetch queue

The 8086 makes use of instruction prefetching to enhance efficiency.
Prefetching just isn’t the main focus of this text, however a short rationalization is important. (I wrote concerning the prefetch circuitry intimately earlier.)
Reminiscence accesses on the 8086 are comparatively sluggish (a minimum of 4 clock cycles), so we do not need to wait each time the processor wants a brand new instruction.
The thought behind prefetching is that the processor fetches future directions from reminiscence whereas the CPU is busy with the present instruction.
When the CPU is able to execute the subsequent instruction, hopefully the instruction is already within the prefetch queue and the CPU would not want to attend for reminiscence.
The 8086 seems to be the primary microprocessor to implement prefetching.

In additional element, the 8086 fetches directions into its prefetch queue asynchronously from instruction execution: The “Bus Interface Unit” performs prefetches, whereas
the “Execution Unit” executes directions.
Prefetched directions are saved within the 6-byte prefetch queue.
The Q bus (brief for “Queue bus”) offers bytes, one after the other, from the prefetch queue to the Execution Unit.7
If the prefetch queue would not have a byte out there when the Execution Unit wants one, the Execution Unit
waits till the prefetch circuitry can full a reminiscence entry.

The loader

To decode and execute an instruction, the Execution Unit should get instruction bytes from the prefetch queue, however this isn’t completely simple.
The principle drawback is that the prefetch queue will be empty, blocking execution.
Second, instruction decoding is comparatively sluggish, so for optimum efficiency, the decoder wants a brand new byte earlier than the present instruction
is completed.
A circuit referred to as the “loader” solves these issues by
utilizing a small state machine (under) to effectively fetch bytes from the queue on the proper time.

The state machine for the 8086 "loader" circuit. I'm not going to explain how it works in this post, but the diagram looks pretty cool.
From patent US4449184.

The state machine for the 8086 “loader” circuit. I am not going to clarify the way it works on this publish, however the diagram appears fairly cool.
From patent US4449184.

The loader generates two timing indicators that synchronize instruction decoding and microcode execution with the prefetch queue.
The FC (First Clock) signifies that the primary instruction byte is on the market, whereas the SC (Second Clock) signifies the second
instruction byte.
Word that the First Clock and Second Clock usually are not essentially consecutive clock cycles as a result of the primary byte might be the final one within the queue,
delaying the Second Clock.

On the finish of a microcode sequence, the Run Subsequent Instruction (RNI) micro-operation causes the loader to fetch the subsequent machine instruction.
Nonetheless, microcode execution could be blocked for a cycle as a result of delay of fetching and decoding the subsequent instruction.
In lots of instances, this may be prevented: if the microcode is aware of that it’s one micro-instruction away from ending,
it points a Subsequent-to-last (NXT) micro-operation so the loader can begin loading the subsequent instruction earlier than the earlier instruction finishes.
As might be proven within the subsequent part,
this often permits micro-instructions to run with out interruption.

Instruction execution

Placing this all collectively, we are able to see how the ADD instruction is executed, cycle by cycle.
Every clock cycle begins with the clock excessive (H) and ends with the clock low (L).8
The sequence begins with the prefetch queue supplying the ADD instruction throughout the Q bus in cycle 1.
The loader signifies that that is First Clock and the instruction is loaded into the microcode tackle register.
It takes a clock cycle for the tackle to exit the tackle register (as indicated by an arrow) together with the microcode counter worth indicating step 0.
To recollect the ALU operation, bits 5-3 of the instruction are saved within the inside X register (unrelated to the AX register).

In cycle 2, the prefetch queue has provided the second byte of the instruction so the loader signifies Second Clock.
Within the second half of cycle 2, the microcode tackle decoder has transformed the instruction-based tackle to the micro-address 018 and provides it to the microcode ROM.

See Also

In cycle 3, the microcode ROM outputs the micro-instruction at micro-address 018: Q→tmpBL, which is able to transfer a byte from the prefetch queue bus (Q bus) to the low byte of the ALU non permanent B register, as described earlier.
It takes a full clock cycle for this motion to happen, because the byte traverses buses to succeed in the register.
This micro-instruction additionally generates the L8 micro-op, which is able to department if an 8-bit operation is going down.
As it is a 16-bit operation, no department takes place.9
In the meantime, the microcode tackle register strikes to step 1, inflicting the decoder to provide the micro-address 019.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

This diagram reveals the execution of an ADD instruction and what’s taking place in numerous elements of the 8086. The arrows present the move from step to step. The character µ is brief for “micro”.

In cycle 4,
the prefetch queue offers a brand new byte, the excessive byte of the fast worth.
The microcode ROM outputs the micro-instruction at micro-address 019: Q→tmpBH, which is able to transfer this byte from the prefetch queue bus to the excessive byte of the ALU non permanent B register.
As earlier than, it takes a full cycle for this transfer to finish.
In the meantime, the microcode tackle register strikes to step 2, inflicting the decoder to provide the micro-address 01a.

In cycle 5,
the microcode ROM outputs the micro-instruction at micro-address 01a: M→tmpA,XI tmpA,NXT.
For the reason that M (supply) register specifies AX, the contents of the AX register might be moved into the ALU tmpA register, however this may take a cycle to finish.
The XI tmpA half begins decoding the ALU operation saved within the X register, on this case ADD.
Lastly, NXT signifies that the subsequent micro-instruction is the final one on this instruction.
Together with the subsequent instruction on the Q bus, this causes the loader to difficulty First Clock. This begins execution of the subsequent machine instruction, although the present instruction continues to be executing.

In cycle 6,
the microcode ROM outputs the micro-instruction at micro-address 01b: Σ→M,RNI.
This can retailer the ALU output into the register indicated by M (i.e. AX), however not but.
Within the first half of cycle 6, the ALU decoder determines the ALU management indicators that may trigger an ADD to happen.
Within the second half of cycle 6, the ALU receives these management indicators and computes the sum.
The RNI (Run Subsequent Instruction) and the second instruction byte from the prefetch queue trigger the loader to difficulty Second Clock, and the micro-address for
the subsequent machine instruction is shipped to the microcode ROM.

Lastly, in cycle 7, the sum is written to the AX register and the flags are up to date, finishing the ADD instruction.
In the meantime, the subsequent instruction is properly underway with its first micro-instruction being executed.

As you’ll be able to see, execution of a micro-instruction is pipelined, with three full clock cycles from the arrival of an instruction till the primary
micro-instruction completes in cycle 4.
Though this technique is complicated, in one of the best case it achieves the purpose of operating a micro-instruction every cycle, with out gaps.
(There are gaps in some instances, mostly when the prefetch queue is empty.
A spot will even happen if the microcode management move would not permit a NXT micro-instruction to be issued.
In that case, the loader cannot difficulty First Clock till the RNI micro-instruction is issued, leading to a delay.)


The 8086 makes use of a number of forms of pipelining to extend efficiency. I’ve centered on the pipelining on the microcode degree, however the 8086 makes use of a minimum of 4 interlocking
forms of pipelining.
First, microcode pipelining permits micro-instructions to finish on the price of 1 per clock cycle, although it takes a number of cycles for a micro-instruction to
Admittedly, this pipeline just isn’t very deep in comparison with the pipelines in RISC processors; the 8086 designers referred to as the overlap within the microcode ROM a “type of mini-pipeline.”10

The second kind of pipelining overlaps instruction decoding and execution. Instruction decoding is pretty sophisticated on the 8086 since there are lots of totally different
codecs of directions, often relying on the second byte (Mod R/M).
The loader coordinates this pipelining, issuing the First Clock and Second Clock indicators so decoding on the subsequent instruction can begin earlier than the earlier instruction has accomplished.
Third is the prefetch queue, which overlaps fetching directions from reminiscence with execution.
That is completed by partitioning the processor into the Bus Interface Unit and the Execution Unit, with the prefetch queue in between.
(I just lately wrote about instruction prefetching intimately.)

There is a remaining kind of pipelining that I have not mentioned. Contained in the reminiscence entry sequence, computing the reminiscence tackle from a phase register and offset is overlapped with the earlier
reminiscence entry. The result’s that reminiscence accesses seem to take 4 cycles, although they actually take six cycles.
I plan to put in writing extra about reminiscence entry in a later publish.

The 8086 was a big advance in dimension, efficiency, and structure in comparison with earlier microprocessors such because the Z80 (1976), 8085 (1977), and 6809 (1978). In addition to transferring to 16 bits, the 8086 had
a significantly extra complicated structure with instruction prefetching and microcode, amongst different options.
On the identical time, the 8086 prevented the architectural overreach of Intel’s ill-fated iAPX 432, a fancy processor that supported rubbish assortment and objects in {hardware}.
Though the 8086’s structure had flaws, it was successful and led to the x86 structure, nonetheless dominant right this moment.

I plan to proceed reverse-engineering the 8086 die so
observe me on Twitter @kenshirriff or RSS for updates.
I’ve additionally began experimenting with Mastodon just lately as @[email protected].
In case you’re within the 8086, I wrote concerning the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top