Reverse-engineering the division microcode within the Intel 8086 processor
Whereas programmers as we speak take division with no consideration, most microprocessors within the Nineteen Seventies may solely add and subtract — division required a gradual and tedious loop carried out in meeting code.
One of many good options of the Intel 8086 processor (1978) was
that it offered machine directions for integer multiplication and division.
Internally, the 8086 nonetheless carried out a loop, however the loop was carried out in microcode: quicker and clear to
the programmer.
Even so, division was a gradual operation, about 50 occasions slower than addition.
I just lately examined multiplication in the 8086, and
now it is time to take a look at the division microcode.1
(There’s numerous overlap with the multiplication publish so apologies for any deja vu.)
The die photograph under reveals the chip underneath a microscope.
I’ve labeled the important thing practical blocks; those which might be essential to this publish are darker.
On the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations on the coronary heart of division: subtraction and shifts.
Division additionally makes use of a number of particular {hardware} options: the X register, the F1 flag, and a loop counter.
The microcode ROM on the decrease proper controls the method.
The 8086 die underneath a microscope, with predominant practical blocks labeled. This photograph reveals the chip with the metallic and polysilicon eliminated, revealing the silicon beneath. Click on on this picture (or some other) for a bigger model.
Microcode
Like most directions, the division routines within the 8086 are carried out in microcode.
Most individuals consider machine directions as the essential steps that a pc performs.
Nevertheless, many processors have one other layer of software program beneath: microcode.
With microcode, as an alternative of constructing the CPU’s management circuitry from advanced logic gates, the management logic is essentially changed with code.
To execute a machine instruction, the pc internally executes a number of easier micro-instructions, specified by the microcode.
That is particularly helpful for a machine instruction corresponding to division, which performs many steps in a loop.
Every micro-instruction within the 8086 is encoded into 21 bits as proven under.
Each micro-instruction strikes information from a supply register to a vacation spot register, every specified with 5 bits.
The that means of the remaining bits is dependent upon the sort area and could be something from an ALU operation to a reminiscence learn or write to
a change of microcode management movement.
Thus, an 8086 micro-instruction sometimes does two issues in parallel: the transfer and the motion.
For extra about 8086 microcode, see my microcode blog post.
Just a few particulars of the ALU (Arithmetic/Logic Unit) operations are obligatory to know the division microcode.
The ALU has three momentary registers which might be invisible to the programmer: tmpA, tmpB, and tmpC.
An ALU operation takes its first argument from the required momentary register, whereas the second argument at all times comes from tmpB.
An ALU operation requires two micro-instructions.
The primary micro-instruction specifies the ALU operation and supply register, configuring the ALU. As an example, ADD tmpA
so as to add tmpA to the default tmpB.
Within the subsequent micro-instruction (or a later one), the ALU outcome could be accessed by way of a register known as Σ
(Sigma) and moved to a different register.
The carry flag performs a key position in division.
The carry flag is without doubt one of the programmer-visible standing flags that’s set by arithmetic operations, however additionally it is used
by the microcode.
For unsigned addition, the carry flag is about if there’s a perform of the phrase (or byte).
For subtraction, the carry flag signifies a borrow, and is about if the subtraction requires a borrow.
Since a borrow outcomes if you happen to subtract a bigger quantity from a smaller quantity, the carry flag additionally signifies
the “lower than” situation.
The carry flag (and different standing flags) are solely up to date if micro-instruction accommodates the F
bit.
The RCL
(Rotate by way of Carry, Left) micro-instruction is closely used within the division microcode.2
This operation shifts the bits in a 16-bit phrase, much like the <<
bit-shift operation in high-level languages, however with an extra characteristic.
As an alternative of discarding the bit on the top, that bit is moved into the carry flag. In the meantime, the bit previously within the carry flag strikes into the phrase.
You’ll be able to consider this as rotating the bits whereas treating the carry flag as a seventeenth little bit of the phrase.
(The RCL
operation may also act on a byte.)
The rotate by way of carry left micro-instruction.
These shifts carry out an essential a part of the division course of since shifting could be seen as multiplying or dividing by two.
RCL
additionally offers a handy approach to transfer the most-significant bit to the carry flag, the place it may be examined for a conditional leap.
(That is essential as a result of the highest bit is used because the signal bit.)
One other essential property is that performing RCL
on a decrease phrase after which RCL
on an higher phrase will carry out a 32-bit shift, since
the excessive little bit of the decrease phrase might be moved into the low little bit of the higher phrase by way of the carry bit.
Lastly, the shift strikes the quotient bit from the carry into the register.
Binary division
The division course of within the 8086 is much like grade-school lengthy division, besides in binary as an alternative of decimal.
The diagram under reveals the method: dividing 67 (the dividend) by 9 (the divisor) yields the quotient 7 on the prime and the rest 4 on the backside.
Lengthy division is less complicated in binary than decimal since you needn’t guess the suitable quotient digit.
As an alternative, at every step you both subtract the divisor (appropriately shifted)
or subtract nothing.
Observe that though the divisor is 4 bits on this instance, the subtractions use 5-bit values.
The necessity for an “further” bit in division might be essential within the dialogue under; 16-bit division wants a 17-bit worth.
0 | 1 | 1 | 1 | ||||||||
1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
– | 0 | 0 | 0 | 0 | |||||||
1 | 0 | 0 | 0 | 0 | |||||||
– | 1 | 0 | 0 | 1 | |||||||
0 | 1 | 1 | 1 | 1 | |||||||
– | 1 | 0 | 0 | 1 | |||||||
0 | 1 | 1 | 0 | 1 | |||||||
– | 1 | 0 | 0 | 1 | |||||||
0 | 1 | 0 | 0 |
As an alternative of shifting the divisor to the suitable every step, the 8086’s algorithm shifts the quotient and the present dividend
to the left every step.
This trick reduces the register house required.
Dividing a 32-bit quantity (the dividend) by a 16-bit quantity yields a 16-bit outcome, so it looks as if you’d want 4 16-bit registers in
whole.
The trick is that after every step, the 32-bit dividend will get one bit smaller, whereas the outcome will get one bit bigger.
Thus, the dividend and the outcome could be packed collectively into 32 bits. On the finish, what’s left of the dividend is
the 16-bit the rest. The desk under illustrates this course of for a pattern dividend (blue) and quotient (inexperienced).3
On the finish, the 16-bit blue worth is the rest.
dividend | quotient | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
The division microcode
The 8086 has 4 division directions to deal with signed and unsigned division of byte and phrase operands.
I am going to begin by describing the microcode for the unsigned phrase division instruction DIV
, which divides a 32-bit dividend by a 16-bit divisor.
The dividend is provided within the AX and DX registers whereas the divisor is specified by the supply operand.
The 16-bit quotient is returned in AX and the 16-bit the rest in DX.
For a divide-by-zero, or if the quotient is bigger than 16 bits, a kind 0 “divide error” interrupt is generated.
CORD
: The core division routine
The CORD
microcode subroutine under implements the long-division algorithm for all division directions; I feel CORD
stands for Core Divide.
At entry, the arguments are within the ALU momentary registers:
tmpA/tmpC maintain the 32-bit dividend, whereas tmpB holds the 16-bit divisor.
(I am going to clarify the configuration for byte division later.)
Every cycle of the loop shifts the values after which probably subtracts the divisor.
One bit is appended to the quotient to point whether or not the
divisor was subtracted or not.
On the finish of the loop, no matter is left of the dividend is the rest.
Every micro-instruction specifies a register transfer on the left, and an motion on the suitable.
The strikes switch phrases between the seen registers and the ALU’s momentary registers, whereas the actions are largely ALU operations or management movement.
As is often the case with microcode, the main points are tough.
The primary three strains under verify if the division will overflow or divide by zero.
The code compares tmpA and tmpB by subtracting tmpB, discarding the outcome, however setting the standing flags (F
).
If the higher phrase of the divisor is larger or equal to the dividend, the division will overflow, so execution jumps to INT0
to generate
a divide-by-zero interrupt.4 (This handles each the case the place the dividend is “too massive” and the divide-by-0 case.)
The variety of loops within the division algorithm is managed by a special-purpose loop counter.
The MAXC
micro-instruction initializes the counter to 7 or 15, for a byte or phrase divide instruction respectively.
transfer motion SUBT tmpA CORD: arrange evaluate Σ → no dest MAXC F evaluate, arrange counter, replace flags JMP NCY INT0 generate interrupt if overflow RCL tmpC 3: predominant loop: left shift tmpA/tmpC Σ → tmpC RCL tmpA Σ → tmpA SUBT tmpA arrange evaluate/subtract JMPS CY 13 leap if prime little bit of tmpA was set Σ → no dest F evaluate, replace flags JMPS NCY 14 leap for subtract JMPS NCZ 3 check counter, loop again to three RCL tmpC 10: completed: Σ → tmpC shift final bit into tmpC Σ → no dest RTN completed: get prime bit, return RCY 13: reset carry Σ → tmpA JMPS NCZ 3 14: subtract, loop JMPS 10 completed, goto 10
The primary loop begins at 3.
The tmpC and tmpA registers are shifted left. This has two essential unwanted effects. First, the previous carry bit (which holds the most recent
quotient bit) is shifted into the underside of tmpC. Second, the highest little bit of tmpA is shifted into the carry bit;
this offers the required “further” bit for the subtraction under.
Particularly, if the carry (the “further” tmpA bit) is about, tmpB could be subtracted, which is completed by leaping to 13.
In any other case, the code compares tmpA and tmpB by
subtracting tmpB, discarding the outcome, and updating the flags (F
).
If there was no borrow/carry (tmpA ≥ tmpB), execution jumps to 14 to subtract.
In any other case, the inner loop counter is decremented and management movement goes again to the highest of the loop if not completed
(NCZ
, Not Counter Zero).
If the loop is finished, tmpC is rotated left to choose up the final quotient bit from the carry flag.
Then a second rotate of tmpC is carried out however the result’s discarded; this places the highest little bit of tmpC into the carry flag for
use later in POSTIDIV
. Lastly, the subroutine returns.
The subtraction path is 13 and 14, which subtract tmpB from tmpA by storing the outcome (Σ) in tmpA.
This path resets the carry flag to be used because the quotient bit.
As within the different path, the loop counter is decremented and examined (NCZ
) and execution both continues again at 3
or finishes at 10.
One complication is that the carry bit is the other of the specified quotient bit.
Particularly, if tmpA < tmpB, the comparability generates a borrow so the carry flag is about to 1.
On this case, the specified quotient bit is 0 and no subtraction takes place.
But when tmpA ≥ tmpB, the comparability doesn’t generate a borrow (so the carry flag is about to 0), the code subtracts tmpB,
and the specified quotient bit is 1.
Thus, tmpC finally ends up holding the complement of the specified outcome; that is fastened later.
The microcode is rigorously designed to pack the divide loop right into a small variety of micro-instructions.
It makes use of the registers and the carry flag in tough methods, utilizing the carry flag to carry the highest little bit of tmpA,
the comparability outcome, and the generated quotient bit.
This makes the code impressively dense however tough to know.
The highest-level division microcode
Now I am going to pop up a degree and check out the top-level microcode (under) that implements the DIV
and IDIV
machine directions.
The primary three directions load tmpA, tmpC, and tmpB from the required registers.
(The M
register refers back to the supply specified within the instruction, both a register or a reminiscence location.)
Subsequent, the X0
situation checks bit 3 of the instruction, which on this case distinguishes DIV
from IDIV
.
For signed division (IDIV
), the microcode calls PREIDIV
, which I am going to talk about under.
Subsequent, the CORD
micro-subroutine mentioned above known as to carry out the division.
DX → tmpA iDIV rmw: load tmpA, tmpC, tmpB AX → tmpC RCL tmpA arrange RCL left shift operation M → tmpB CALL X0 PREIDIV arrange integer division if IDIV CALL CORD name CORD to carry out division COM1 tmpC set as much as complement the quotient DX → tmpB CALL X0 POSTIDIV get authentic dividend, deal with IDIV Σ → AX NXT retailer up to date quotient tmpA → DX RNI retailer the rest, run subsequent instruction
As mentioned above, the quotient in tmpC must be 1’s-complemented, which is finished with COM1
.
For IDIV
, the micro-subroutine POSTIDIV
units the indicators of the outcomes appropriately.
The outcomes are saved within the AX
and DX
registers.
The NXT
micro-operation signifies the subsequent micro-instruction is the final one, directing the microcode engine to
begin the subsequent
machine instruction. Lastly, RNI
directs the microcode engine to run the subsequent machine instruction.
8-bit division
The 8086 has separate opcodes for 8-bit division.
The 8086 helps many directions with byte and phrase variations, utilizing 8-bit or 16-bit arguments respectively.
Normally, the byte and phrase directions use the identical microcode, with the ALU and register {hardware} utilizing bytes or phrases based mostly on the instruction.
Within the case of division,
the shift micro-operations act on tmpA and tmpC as 8-bit registers moderately than 16-bit registers.
Furthermore, the MAXC
micro-operation initializes the inner counter to 7 moderately than 15.
Thus, the identical CORD
microcode is used for phrase and byte division, however the variety of loops and the precise
operations are modified by the {hardware}.
The diagram under reveals the tmpA and tmpC registers throughout every step of dividing 0x2345 by 0x34.
Observe that the registers are handled as 8-bit registers.
The divisor (blue) steadily shrinks with the quotient (inexperienced) taking its place.
On the finish, the rest is 0x41 (blue) and the quotient is 0xad, complement of the inexperienced worth.
tmpA | tmpC | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
Though the CORD
routine is shared for byte and phrase division, the top-level microcode is totally different.
Particularly, the byte and phrase division directions use totally different registers, requiring microcode modifications.
The microcode under is the top-level code for byte division. It’s nearly the identical because the microcode above, besides it
makes use of the highest and backside bytes of the accumulator (AH
and AL
) moderately than the AX
and DX
registers.
AH → tmpA iDIV rmb: get arguments AL → tmpC RCL tmpA arrange RCL left shift operation M → tmpB CALL X0 PREIDIV deal with signed division if IDIV CALL CORD name CORD to carry out division COM1 tmpC complement the quotient AH → tmpB CALL X0 POSTIDIV deal with signed division if IDIV Σ → AL NXT retailer quotient tmpA → AH RNI retailer the rest, run subsequent instruction
Signed division
The 8086 (like most computer systems) represents signed numbers utilizing a format known as two’s complement.
Whereas a daily byte holds a quantity from 0 to 255, a signed byte holds a quantity from -128 to 127.
A detrimental quantity is shaped by flipping all of the bits (often known as the one’s complement) after which including 1, yielding the 2’s complement worth.
As an example, +5 is 0x05
whereas -5 is 0xfb
.
(Observe that the highest little bit of a quantity is about for a detrimental quantity; that is the signal bit.)
The good factor about two’s complement numbers is that the identical addition and subtraction operations work on each signed and unsigned values.
Sadly, this isn’t the case for signed multiplication and division.
The 8086 has separate IDIV
(Integer Divide) directions to carry out signed division.
The 8086 performs signed division by changing the arguments to constructive values, performing unsigned division, after which
negating the outcomes if obligatory.
As proven earlier, signed and unsigned division each use the identical top-level microcode and the microcode conditionally calls some subroutines for
signed division.
These further subroutines trigger a major efficiency penalty, making signed division over 20 cycles slower than unsigned division.
I’ll talk about these micro-subroutines under.
PREIDIV
The primary subroutine for signed division is PREIDIV
, performing preliminary operations for integer division.
It converts the 2 arguments, saved in tmpA/tmpC and tmpB, to constructive values.
It retains monitor of the indicators utilizing an inside flag known as F1
, toggling this flag for every detrimental argument.
This conveniently handles the rule that two negatives make a constructive since complementing the F1
flag twice will clear it.
The purpose of that is that the primary division code (CORD
) solely must deal with unsigned division.
The microcode under implements PREIDIV
.
First it checks if tmpA is detrimental, however
the 8086 doesn’t have a microcode situation to straight check the signal of a price.
As an alternative, the microcode determines if a price is detrimental by shifting the worth left, which strikes the highest (signal) bit into the carry flag.
The conditional leap (NCY
) then checks if the carry is obvious, leaping if the worth is non-negative.
If tmpA is detrimental, execution falls by way of to negate the primary argument.
That is tough as a result of the argument is cut up between the tmpA and tmpC registers.
The 2’s complement operation (NEG
) is utilized to the low phrase, whereas both 2’s complement or one’s complement (COM1
) is utilized to
the higher phrase, relying on the carry for mathematical causes.5
The F1
flag is complemented to maintain monitor of the signal.
(The multiplication course of reuses most of this code, beginning on the NEGATE
entry level.)
Σ → no dest PREIDIV: shift tmpA left JMPS NCY 7 leap if non-negative NEG tmpC NEGATE: negate tmpC Σ → tmpC COM1 tmpA F possibly complement tmpA JMPS CY 6 NEG tmpA negate tmpA if there is not any carry Σ → tmpA CF1 6: toggle F1 (signal) RCL tmpB 7: check signal of tmpB Σ → no dest NEG tmpB possibly negate tmpB JMPS NCY 11 skip if tmpB constructive Σ → tmpB CF1 RTN else negate tmpB, toggle F1 (signal) RTN 11: return
The following a part of the code, beginning at 7, negates tmpB (the divisor) whether it is detrimental. Because the divisor is a single
phrase, this code is easier.
As earlier than, the F1
flag is toggled if tmpB is detrimental.
On the finish, each arguments (tmpA/tmpC and tmpB) are constructive, and F1
signifies the signal of the outcome.
POSTIDIV
After computing the outcome, the POSTIDIV
routine known as for signed division.
The routine first checks for a signed overflow and raises a divide-by-zero interrupt if that’s the case.
Subsequent, the routine negates the quotient and the rest if obligatory.6
In additional element, the CORD
routine left the highest little bit of tmpC (the complemented quotient) within the carry flag.
Now, that bit is examined. If the carry bit is 0 (NCY
), then the highest little bit of the quotient is 1 so the quotient is just too huge to slot in a signed worth.7
On this case, the INT0
routine is executed to set off a kind 0 interrupt, indicating a divide overflow.
(It is a moderately roundabout method of testing the quotient, counting on a carry bit that was set in a earlier subroutine.)
JMP NCY INT0 POSTIDIV: if overflow, set off interrupt RCL tmpB arrange rotate of tmpB Σ → no dest NEG tmpA get signal of tmpB, arrange negate of tmpA JMPS NCY 5 skip if tmpB non-negative Σ → tmpA in any other case negate tmpA (the rest) INC tmpC 5: arrange increment JMPS F1 8 check signal flag, skip if set COM1 tmpC in any other case arrange complement CCOF RTN 8: clear carry and overflow flags, return
Subsequent, tmpB (the divisor) is rotated to see whether it is detrimental.
(The caller loaded tmpB with the unique divisor, changing the dividend that was in tmpB beforehand.)
If the divisor is detrimental, tmpA (the rest) is negated.
This implements the 8086 rule that the signal of the rest matches the signal of the divisor.
The quotient dealing with is a bit tough. Recall that tmpC holds the complemented quotient.
the F1
flag is about if the outcome must be detrimental. In that case, the complemented quotient must be incremented
by 1 (INC
) to transform from 1’s complement to 2’s complement.
However, if the quotient must be constructive, 1’s-complementing tmpC (COM1
) will yield the specified constructive
quotient.
In both case, the ALU is configured in POSTIDIV
, however the outcome might be saved again in the primary routine.
Lastly, the CCOF
micro-operation clears the carry and overflow flags.
Curiously, the 8086 documentation declares that the standing flags are undefined following IDIV
, however the microcode
explicitly clears the carry and overflow flags.
I assume that the flags had been cleared in analogy with MUL
, however then Intel determined that this wasn’t helpful so that they
did not doc it. (Documenting this characteristic would obligate them to supply the identical performance in later x86 chips.)
The {hardware} for division
For probably the most half, the 8086 makes use of the common ALU addition and shifts for the division algorithm. Some particular {hardware}
options present help.
On this part, I am going to take a look at this {hardware}.
Loop counter
The 8086 has a 4-bit loop counter for multiplication and division. This counter begins at 7 for byte division and 15 for phrase division,
based mostly on the low little bit of the opcode.
This loop counter permits the microcode to decrement the counter, check for the top, and carry out a conditional department in a single micro-operation.
The counter is carried out with 4 flip-flops, together with logic to compute the worth after decrementing by one.
The MAXC
(Most Depend) micro-instruction units the counter to 7 or 15 for byte or phrase operations respectively.
The NCZ
(Not Counter Zero) micro-instruction has two actions. First, it performs a conditional leap if the counter is nonzero.
Second, it decrements the counter.
The F1 flag
Signed multiplication and division use an inside flag known as F1
8 to maintain monitor of the signal.
The F1
flag is toggled by microcode by way of the CF1
(Complement F1) micro-instruction.
The F1
flag is carried out with a flip-flop, together with a multiplexer to pick out the worth. It’s cleared when a brand new instruction begins,
set by a REP
prefix, and toggled by the CF1
micro-instruction.
The diagram under reveals how the F1 latch and the loop counter seem on the die. On this picture, the metallic layer has been eliminated, exhibiting the
silicon and the polysilicon wiring beneath.
The counter and F1 latch as they seem on the die. The latch for the REP state can also be right here.
X register
The division microcode makes use of an inside register known as the X
register to differentiate between the DIV
and IDIV
directions.
The X
register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.9
Because the instruction is held within the Instruction Register, you may surprise why a separate register is required.
The motivation is that some opcodes specify the kind of ALU operation within the second byte of the instruction, the ModR/M byte, bits 5–3.10
Because the ALU operation is typically specified within the first byte and typically within the second byte, the X
register was added to deal with
each these circumstances.
For probably the most half, the X
register signifies which of the eight normal ALU operations is chosen (ADD
, OR
, ADC
, SBB
, AND
, SUB
, XOR
, CMP
).
Nevertheless, a number of directions use bit 0 of the X
register to differentiate between different pairs of directions.
As an example, it distinguishes between MUL
and IMUL
, DIV
and IDIV
, CMPS
and SCAS
, MOVS
and LODS
, or AAA
and AAS
.
Whereas these instruction pairs might seem to have arbitrary opcodes, they’ve been rigorously assigned
so the microcode can distinguish them.
The implementation of the X
register is simple, consisting of three flip-flops to carry the three bits of the instruction.
The flip-flops are loaded from the prefetch queue bus throughout First Clock and through Second Clock for acceptable directions, because the
instruction bytes journey over the bus.
Testing bit 0 of the X
register with the X0
situation is supported by the microcode situation analysis circuitry, so it may be used for conditional jumps within the microcode.
Algorithmic and historic context
As you may see from the microcode, division is a sophisticated and comparatively gradual course of.
On the 8086, division takes as much as 184 clock cycles to carry out all of the microcode steps.
(Compared, two registers could be added in 3 clock cycles.)
Multiplication and division each loop over the bits, performing repeated addition or subtraction respectively.
However division requires a choice (subtract or not?) at every step, making it even slower, about half the pace of multiplication.11
Numerous algorithms have been developed to hurry up division.
Moderately than performing lengthy division one bit at a time, you are able to do lengthy division in, say, base 4, producing two quotient
bits in every step.
As with decimal lengthy division, the tough half is determining what digit to pick out. The SRT algorithm (1957) makes use of a small
lookup desk to estimate the quotient digit from a number of bits of the divisor and dividend.
The intelligent half is that the chosen digit would not should be precisely proper at every step; the algorithm will self-correct
after a flawed “guess”.
The Pentium processor (1993) famously had a floating point division bug
due to some lacking values within the SRT desk. This bug value Intel $475 million to switch the defective processors.
Intel’s x86 processors steadily improved divide efficiency. The 80286 (1982) carried out a phrase divide in 22 clocks, about
6 occasions as quick because the 8086.
Within the Penryn structure (2007), Intel upgraded from
Radix-4 to Radix-16 division.
Moderately than having separate integer and floating-point {hardware}, integer divides had been dealt with by way of the floating level divider.
Though trendy Intel processors have enormously improved multiplication and division in comparison with the 8086, division remains to be a comparatively gradual operation.
Whereas a Tiger Lake (2020) processor can carry out an integer multiplication each clock cycle (with a latency of three cycles),
division is way slower and may solely be completed as soon as each 6-10 clock cycles (details).
I’ve written quite a few posts on the 8086 up to now and
plan to proceed reverse-engineering the 8086 die so
observe me on Twitter @kenshirriff or RSS for updates.
I’ve additionally began experimenting with Mastodon just lately as @[email protected].