How the 8086 processor determines the size of an instruction
The Intel 8086 processor (1978) has a sophisticated instruction set with directions starting from one to 6 bytes lengthy.
This raises the query of how the processor is aware of the size of an instruction.1
The reply is that the 8086 makes use of an attention-grabbing mixture of lookup ROMs and microcode to find out what number of bytes to make use of for an instruction.
Briefly, the ROMs carry out sufficient decoding to determine if it wants one byte or two.
After that, the microcode merely consumes instruction bytes because it wants them.
Thus, nothing within the chip explicitly “is aware of” the size of an instruction.
This weblog submit describes this course of in additional element.
The die picture beneath exhibits the chip underneath a microscope.
I’ve labeled the important thing practical blocks; those which might be vital to this submit are darker.
Architecturally, the chip is partitioned right into a Bus Interface Unit (BIU) on the prime and an Execution Unit (EU) beneath.
The BIU handles bus and reminiscence exercise in addition to instruction prefetching, whereas the Execution Unit (EU) executes the directions.
The 8086 die underneath a microscope, with fundamental practical blocks labeled. This picture exhibits the chip with the steel and polysilicon eliminated, revealing the silicon beneath. Click on on this picture (or every other) for a bigger model.
The prefetch queue, the loader, and the microcode
The 8086 makes use of a 6-byte instruction prefetch queue to carry directions, and this queue will play an vital function on this dialogue.3
Earlier microprocessors learn directions from reminiscence as they had been wanted, which may trigger the CPU to attend on reminiscence.
The 8086, as an alternative, learn directions from reminiscence earlier than they had been wanted, storing them within the instruction prefetch queue.
(You’ll be able to consider this as a primitive instruction cache.)
To execute an instruction, the 8086 took bytes out of the queue one by one. If the queue ran empty, the processor waited till extra
instruction bytes had been fetched from reminiscence into the queue.
A circuit known as the loader handles the interplay between the prefetch queue and instruction execution.
The loader is a small state machine that gives management alerts to the remainder of the execution circuitry.
The loader will get the primary byte of an instruction from the prefetch queue and points a sign FC (First Clock) that begins execution of the instruction.
At this level, the Group Decode ROM performs the primary stage of instruction decoding, classifying the instruction into numerous classes based mostly
on the opcode byte.
A lot of the 8086’s directions are applied in microcode.
Nonetheless, a number of directions are so easy that they’re applied with logic circuits. For instance, the CLC
(Clear Carry) instruction
clears the carry flag straight.
The Group Decode ROM categorizes these directions as 1BL (one-byte, applied in logic). The loader responds by issuing an SC (Second Clock)
sign to wrap up execution and begin the following instruction. Thus, these easy directions take two clock cycles.
The 8086 has numerous prefix bytes that may be put in entrance of an instruction to vary its conduct.
As an example, a section prefix adjustments the reminiscence section that the instruction makes use of. A LOCK
prefix locks the bus through the subsequent
instruction. The Group Decode ROM detects a prefix and outputs a prefix sign. This causes the prefix to be dealt with in logic,
relatively than microcode, much like the 1BL directions.
Thus, a prefix additionally takes one byte and two clock cycles.
The remaining directions are dealt with by microcode.2
Let’s begin with a one-byte instruction similar to INC AX
, which increments the AX
register.
As earlier than, the loader will get the instruction byte from the prefix queue.
The Group Decode ROM determines that this instruction is applied in microcode and might begin after one byte, so the microcode engine begins
operating.
The microcode beneath handles the increment and decrement directions. It strikes the suitable register, indicated by M
to the ALU’s non permanent B
register.
It places the incremented or decremented end result (Σ) again into the register (M
). RNI
tells the loader to run the following instruction.
With two micro-instruction, this instruction takes two clock cycles.
M → tmpB XI tmpB, NX INC/DEC: get worth from M, arrange ALU Σ → M WB,RNI F put end in M, run subsequent instruction
However what occurs with an instruction that’s a couple of byte lengthy, similar to including an instantaneous worth to a register?
Let’s take a look at ADD AX,1234
, which provides 1234 to the AX register.
As earlier than, the loader reads one byte after which the microcode engine begins operating.
At this level, the 8086 does not “understand” that it is a 3-byte instruction.
The primary line of the microcode beneath will get one byte of the fast operand: Q→tmpBL
masses a byte from the instruction prefetch queue into the low byte of the non permanent B
register.
Equally, the second line masses the second byte. The following line places the register worth (M
) in tmpA
. The final line places
the sum again into the register and runs the following instruction.
Since this instruction takes two bytes from the prefetch queue, it’s three bytes lengthy in whole.
However nothing explicitly signifies this instruction is three bytes lengthy.
Q → tmpBL JMPS L8 2 alu A,i: get byte from queue Q → tmpBH get byte from queue M → tmpA XI tmpA, NX get worth from M, arrange ALU Σ → M WB,RNI F put end in M, run subsequent instruction
You can even add a one-byte fast worth to a register, similar to ADD AL,12
. This makes use of the identical microcode above. Nonetheless,
within the first line, JMPS L8
is a conditional soar that skips the second micro-instruction if the information size is 8 bits.
Thus, the microcode solely consumes one byte from the prefetch queue, making the instruction two bytes lengthy.
In different phrases, what makes this instruction two bytes as an alternative of three is the bit within the opcode which triggers the conditional soar within the microcode.
The 8086 has one other class of directions, these with a ModR/M byte following the opcode.
The Group Decode ROM classifies these directions as 2BR (two-byte ROM) indicating that the second byte should be fetched earlier than processing by
the microcode ROM.
For these directions, the loader fetches the second byte from the prefetch queue earlier than triggering the SC (Second Clock sign) to start out microcode execution.
The ModR/M byte signifies the addressing mode that the instruction ought to use, similar to register-to-register or memory-to-register.
The ModR/M can change the instruction size by specifying an handle displacement of 1 or two bytes.
A second ROM known as the Translation ROM selects the suitable microcode for the addressing mode (details).
For instance, if the addressing mode consists of an handle displacement, the microcode beneath is used:
Q → tmpBL JMPS MOD1 12 [i]: get byte(s) Q → tmpBH Σ → tmpA BX EAFINISH 12: add displacement
This microcode fetches two displacement bytes from the prefetch queue (Q
).
Nonetheless, if the ModR/M byte specifies a one-byte displacement, the MOD1
situation causes the microcode to leap over the second
fetch.
Thus, this microcode makes use of one or two further instruction bytes relying on the worth of the ModR/M
byte.
To summarize, nothing within the 8086 “is aware of” how lengthy an instruction is.
The Group Decode ROM makes a part of the choice, classifying directions as a prefix, 1-byte logic, 2-byte ROM, or in any other case, inflicting the
loader to fetch one or two bytes.
The microcode then consumes instruction bytes as wanted.
In the long run, the size of an 8086 instruction is set by what number of bytes are taken from the prefetch queue by the point it ends.
Another programs
It is attention-grabbing to see how different processors cope with instruction size.
For instance, RISC processors (Diminished Instruction Set Computer systems) sometimes have fixed-length directions.
As an example, the ARM-1 processor used 32-bit directions, making instruction decoding quite simple.
Early microprocessors such because the MOS Know-how 6502 (1975) did not use microcode, however had been managed by state machines.
The CPU fetches instruction bytes from reminiscence as wanted, because it strikes by way of numerous execution states.
Thus, as with the 8086, the size of an instruction wasn’t express, however was what number of bytes it used.
The IBM 1401 laptop (1959) took a very totally different method with its variable-length phrases.
Every character in reminiscence had an related “phrase mark” bit, which you’ll be able to consider as a metadata bit.
Every machine instruction consisted of a variable variety of characters with a phrase mark on the primary one.
Thus, the processor may learn instruction characters till it hit a phrase mark, which indicated the beginning of
the following instruction.
The phrase mark explicitly indicated to the processor how lengthy every instruction was.
Maybe the worst method for variable-length directions was the Intel iAPX 432 processor (1981), which had directions
with variable bit lengths, from 6 to 321 bits lengthy.
Because of this, directions weren’t aligned on byte boundaries, making instruction decoding much more inconvenient.
This was simply one of many causes that the iAPX 432 ended up overly difficult, years delayed, and a business failure.
The 8086’s variable-length directions led to the x86 structure, with directions from 1 to fifteen bytes lengthy.
That is significantly inconvenient with trendy superscalar processors that run a number of directions in parallel.
The issue is that the processor should break the instruction stream into particular person directions earlier than they execute.
The Intel P6 microarchitecture used within the Pentium Professional (1995) has instruction decoders to decode the instruction stream into micro-operations.4
It begins with an “instruction size block” that analyzes the primary bytes of the instruction to find out how lengthy it’s.
(This isn’t an easy activity to carry out quickly on a number of directions in parallel.)
The “instruction steering block” makes use of this info to interrupt the byte stream into directions and steer directions to
the decoders.
The AMD K6 3D processor (1999) had predecode logic that related 5 predecode bits with every instruction byte: three pointed to
the beginning of the following instruction, one indicated the size trusted a D bit, and one indicated the presence of a ModR/M byte.
This logic examined as much as three bytes to make its choices.
Directions had been cut up aside and assigned to decoders based mostly on the predecode bits.
In some circumstances, the predecode logic gave up and flagged the instruction as “unsuccessfully predecoded”, for example an instruction longer than 7 bytes.
These directions had been dealt with by a slower path.
Conclusions
The 8086 processor has directions with quite a lot of lengths, however nothing within the processor explicitly determines the size.
As an alternative, an instruction makes use of as many bytes because it wants.
(That sounds type of tautological, however I am unsure how else to place it.)
The Group Decode ROM makes an preliminary classification, the Translation ROM determines the addressing mode, and the microcode
consumes bytes as wanted.
Whereas this method gave the 8086 a versatile instruction set, it created an issue in the long term for the x86 structure,
requiring difficult logic to find out instruction size.
One good thing about RISC-based processors such because the Apple M1 is that they’ve (largely) fixed instruction lengths, making instruction
decoding sooner and easier.
I’ve written a number of posts on the 8086 to this point and
plan to proceed reverse-engineering the 8086 die so
comply with me on Twitter @kenshirriff or RSS for updates.
I’ve additionally began experimenting with Mastodon lately as @[email protected].