# Reverse-engineering the division microcode within the Intel 8086 processor

*by*Phil Tadros

Whereas programmers as we speak take division with no consideration, most microprocessors within the Nineteen Seventies may solely add and subtract — division required a gradual and tedious loop carried out in meeting code.

One of many good options of the Intel 8086 processor (1978) was

that it offered machine directions for integer multiplication and division.

Internally, the 8086 nonetheless carried out a loop, however the loop was carried out in microcode: quicker and clear to

the programmer.

Even so, division was a gradual operation, about 50 occasions slower than addition.

I just lately examined multiplication in the 8086, and

now it is time to take a look at the division microcode.1

(There’s numerous overlap with the multiplication publish so apologies for any deja vu.)

The die photograph under reveals the chip underneath a microscope.

I’ve labeled the important thing practical blocks; those which might be essential to this publish are darker.

On the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations on the coronary heart of division: subtraction and shifts.

Division additionally makes use of a number of particular {hardware} options: the X register, the F1 flag, and a loop counter.

The microcode ROM on the decrease proper controls the method.

The 8086 die underneath a microscope, with predominant practical blocks labeled. This photograph reveals the chip with the metallic and polysilicon eliminated, revealing the silicon beneath. Click on on this picture (or some other) for a bigger model.

## Microcode

Like most directions, the division routines within the 8086 are carried out in microcode.

Most individuals consider machine directions as the essential steps that a pc performs.

Nevertheless, many processors have one other layer of software program beneath: microcode.

With microcode, as an alternative of constructing the CPU’s management circuitry from advanced logic gates, the management logic is essentially changed with code.

To execute a machine instruction, the pc internally executes a number of easier micro-instructions, specified by the microcode.

That is particularly helpful for a machine instruction corresponding to division, which performs many steps in a loop.

Every micro-instruction within the 8086 is encoded into 21 bits as proven under.

Each micro-instruction strikes information from a supply register to a vacation spot register, every specified with 5 bits.

The that means of the remaining bits is dependent upon the sort area and could be something from an ALU operation to a reminiscence learn or write to

a change of microcode management movement.

Thus, an 8086 micro-instruction sometimes does two issues in parallel: the transfer and the motion.

For extra about 8086 microcode, see my microcode blog post.

Just a few particulars of the ALU (Arithmetic/Logic Unit) operations are obligatory to know the division microcode.

The ALU has three momentary registers which might be invisible to the programmer: tmpA, tmpB, and tmpC.

An ALU operation takes its first argument from the required momentary register, whereas the second argument at all times comes from tmpB.

An ALU operation requires two micro-instructions.

The primary micro-instruction specifies the ALU operation and supply register, configuring the ALU. As an example, `ADD tmpA`

so as to add tmpA to the default tmpB.

Within the subsequent micro-instruction (or a later one), the ALU outcome could be accessed by way of a register known as `Σ`

(Sigma) and moved to a different register.

The carry flag performs a key position in division.

The carry flag is without doubt one of the programmer-visible standing flags that’s set by arithmetic operations, however additionally it is used

by the microcode.

For unsigned addition, the carry flag is about if there’s a perform of the phrase (or byte).

For subtraction, the carry flag signifies a borrow, and is about if the subtraction requires a borrow.

Since a borrow outcomes if you happen to subtract a bigger quantity from a smaller quantity, the carry flag additionally signifies

the “lower than” situation.

The carry flag (and different standing flags) are solely up to date if micro-instruction accommodates the `F`

bit.

The `RCL`

(Rotate by way of Carry, Left) micro-instruction is closely used within the division microcode.2

This operation shifts the bits in a 16-bit phrase, much like the `<<`

bit-shift operation in high-level languages, however with an extra characteristic.

As an alternative of discarding the bit on the top, that bit is moved into the carry flag. In the meantime, the bit previously within the carry flag strikes into the phrase.

You’ll be able to consider this as rotating the bits whereas treating the carry flag as a seventeenth little bit of the phrase.

(The `RCL`

operation may also act on a byte.)

The rotate by way of carry left micro-instruction.

These shifts carry out an essential a part of the division course of since shifting could be seen as multiplying or dividing by two.

`RCL`

additionally offers a handy approach to transfer the most-significant bit to the carry flag, the place it may be examined for a conditional leap.

(That is essential as a result of the highest bit is used because the signal bit.)

One other essential property is that performing `RCL`

on a decrease phrase after which `RCL`

on an higher phrase will carry out a 32-bit shift, since

the excessive little bit of the decrease phrase might be moved into the low little bit of the higher phrase by way of the carry bit.

Lastly, the shift strikes the quotient bit from the carry into the register.

## Binary division

The division course of within the 8086 is much like grade-school lengthy division, besides in binary as an alternative of decimal.

The diagram under reveals the method: dividing 67 (the dividend) by 9 (the divisor) yields the quotient 7 on the prime and the rest 4 on the backside.

Lengthy division is less complicated in binary than decimal since you needn’t guess the suitable quotient digit.

As an alternative, at every step you both subtract the divisor (appropriately shifted)

or subtract nothing.

Observe that though the divisor is 4 bits on this instance, the subtractions use 5-bit values.

The necessity for an “further” bit in division might be essential within the dialogue under; 16-bit division wants a 17-bit worth.

0 | 1 | 1 | 1 | ||||||||

1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |

– | 0 | 0 | 0 | 0 | |||||||

1 | 0 | 0 | 0 | 0 | |||||||

– | 1 | 0 | 0 | 1 | |||||||

0 | 1 | 1 | 1 | 1 | |||||||

– | 1 | 0 | 0 | 1 | |||||||

0 | 1 | 1 | 0 | 1 | |||||||

– | 1 | 0 | 0 | 1 | |||||||

0 | 1 | 0 | 0 |

As an alternative of shifting the divisor to the suitable every step, the 8086’s algorithm shifts the quotient and the present dividend

to the left every step.

This trick reduces the register house required.

Dividing a 32-bit quantity (the dividend) by a 16-bit quantity yields a 16-bit outcome, so it looks as if you’d want 4 16-bit registers in

whole.

The trick is that after every step, the 32-bit dividend will get one bit smaller, whereas the outcome will get one bit bigger.

Thus, the dividend and the outcome could be packed collectively into 32 bits. On the finish, what’s left of the dividend is

the 16-bit the rest. The desk under illustrates this course of for a pattern dividend (blue) and quotient (inexperienced).3

On the finish, the 16-bit blue worth is the rest.

dividend | quotient | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |

0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |

0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |

## The division microcode

The 8086 has 4 division directions to deal with signed and unsigned division of byte and phrase operands.

I am going to begin by describing the microcode for the unsigned phrase division instruction `DIV`

, which divides a 32-bit dividend by a 16-bit divisor.

The dividend is provided within the AX and DX registers whereas the divisor is specified by the supply operand.

The 16-bit quotient is returned in AX and the 16-bit the rest in DX.

For a divide-by-zero, or if the quotient is bigger than 16 bits, a kind 0 “divide error” interrupt is generated.

`CORD`

: The core division routine

The `CORD`

microcode subroutine under implements the long-division algorithm for all division directions; I feel `CORD`

stands for Core Divide.

At entry, the arguments are within the ALU momentary registers:

tmpA/tmpC maintain the 32-bit dividend, whereas tmpB holds the 16-bit divisor.

(I am going to clarify the configuration for byte division later.)

Every cycle of the loop shifts the values after which probably subtracts the divisor.

One bit is appended to the quotient to point whether or not the

divisor was subtracted or not.

On the finish of the loop, no matter is left of the dividend is the rest.

Every micro-instruction specifies a register transfer on the left, and an motion on the suitable.

The strikes switch phrases between the seen registers and the ALU’s momentary registers, whereas the actions are largely ALU operations or management movement.

As is often the case with microcode, the main points are tough.

The primary three strains under verify if the division will overflow or divide by zero.

The code compares tmpA and tmpB by subtracting tmpB, discarding the outcome, however setting the standing flags (`F`

).

If the higher phrase of the divisor is larger or equal to the dividend, the division will overflow, so execution jumps to `INT0`

to generate

a divide-by-zero interrupt.4 (This handles each the case the place the dividend is “too massive” and the divide-by-0 case.)

The variety of loops within the division algorithm is managed by a special-purpose loop counter.

The `MAXC`

micro-instruction initializes the counter to 7 or 15, for a byte or phrase divide instruction respectively.

transfer motion SUBT tmpACORD:arrange evaluate Σ → no dest MAXC F evaluate, arrange counter, replace flags JMP NCY INT0 generate interrupt if overflow RCL tmpC3:predominant loop: left shift tmpA/tmpC Σ → tmpC RCL tmpA Σ → tmpA SUBT tmpA arrange evaluate/subtract JMPS CY 13 leap if prime little bit of tmpA was set Σ → no dest F evaluate, replace flags JMPS NCY 14 leap for subtract JMPS NCZ 3 check counter, loop again to three RCL tmpC10:completed: Σ → tmpC shift final bit into tmpC Σ → no dest RTN completed: get prime bit, return RCY13:reset carry Σ → tmpA JMPS NCZ 314:subtract, loop JMPS 10 completed, goto 10

The primary loop begins at *3*.

The tmpC and tmpA registers are shifted left. This has two essential unwanted effects. First, the previous carry bit (which holds the most recent

quotient bit) is shifted into the underside of tmpC. Second, the highest little bit of tmpA is shifted into the carry bit;

this offers the required “further” bit for the subtraction under.

Particularly, if the carry (the “further” tmpA bit) is about, tmpB could be subtracted, which is completed by leaping to *13*.

In any other case, the code compares tmpA and tmpB by

subtracting tmpB, discarding the outcome, and updating the flags (`F`

).

If there was no borrow/carry (tmpA ≥ tmpB), execution jumps to *14* to subtract.

In any other case, the inner loop counter is decremented and management movement goes again to the highest of the loop if not completed

(`NCZ`

, Not Counter Zero).

If the loop is finished, tmpC is rotated left to choose up the final quotient bit from the carry flag.

Then a second rotate of tmpC is carried out however the result’s discarded; this places the highest little bit of tmpC into the carry flag for

use later in `POSTIDIV`

. Lastly, the subroutine returns.

The subtraction path is *13* and *14*, which subtract tmpB from tmpA by storing the outcome (Σ) in tmpA.

This path resets the carry flag to be used because the quotient bit.

As within the different path, the loop counter is decremented and examined (`NCZ`

) and execution both continues again at *3*

or finishes at *10*.

One complication is that the carry bit is the other of the specified quotient bit.

Particularly, if tmpA < tmpB, the comparability generates a borrow so the carry flag is about to 1.

On this case, the specified quotient bit is 0 and no subtraction takes place.

But when tmpA ≥ tmpB, the comparability doesn’t generate a borrow (so the carry flag is about to 0), the code subtracts tmpB,

and the specified quotient bit is 1.

Thus, tmpC finally ends up holding the *complement* of the specified outcome; that is fastened later.

The microcode is rigorously designed to pack the divide loop right into a small variety of micro-instructions.

It makes use of the registers and the carry flag in tough methods, utilizing the carry flag to carry the highest little bit of tmpA,

the comparability outcome, and the generated quotient bit.

This makes the code impressively dense however tough to know.

### The highest-level division microcode

Now I am going to pop up a degree and check out the top-level microcode (under) that implements the `DIV`

and `IDIV`

machine directions.

The primary three directions load tmpA, tmpC, and tmpB from the required registers.

(The `M`

register refers back to the supply specified within the instruction, both a register or a reminiscence location.)

Subsequent, the `X0`

situation checks bit 3 of the instruction, which on this case distinguishes `DIV`

from `IDIV`

.

For signed division (`IDIV`

), the microcode calls `PREIDIV`

, which I am going to talk about under.

Subsequent, the `CORD`

micro-subroutine mentioned above known as to carry out the division.

DX → tmpAiDIV rmw:load tmpA, tmpC, tmpB AX → tmpC RCL tmpA arrange RCL left shift operation M → tmpB CALL X0 PREIDIV arrange integer division if IDIV CALL CORD name CORD to carry out division COM1 tmpC set as much as complement the quotient DX → tmpB CALL X0 POSTIDIV get authentic dividend, deal with IDIV Σ → AX NXT retailer up to date quotient tmpA → DX RNI retailer the rest, run subsequent instruction

As mentioned above, the quotient in tmpC must be 1’s-complemented, which is finished with `COM1`

.

For `IDIV`

, the micro-subroutine `POSTIDIV`

units the indicators of the outcomes appropriately.

The outcomes are saved within the `AX`

and `DX`

registers.

The `NXT`

micro-operation signifies the subsequent micro-instruction is the final one, directing the microcode engine to

begin the subsequent

machine instruction. Lastly, `RNI`

directs the microcode engine to run the subsequent machine instruction.

## 8-bit division

The 8086 has separate opcodes for 8-bit division.

The 8086 helps many directions with byte and phrase variations, utilizing 8-bit or 16-bit arguments respectively.

Normally, the byte and phrase directions use the identical microcode, with the ALU and register {hardware} utilizing bytes or phrases based mostly on the instruction.

Within the case of division,

the shift micro-operations act on tmpA and tmpC as 8-bit registers moderately than 16-bit registers.

Furthermore, the `MAXC`

micro-operation initializes the inner counter to 7 moderately than 15.

Thus, the identical `CORD`

microcode is used for phrase and byte division, however the variety of loops and the precise

operations are modified by the {hardware}.

The diagram under reveals the tmpA and tmpC registers throughout every step of dividing 0x2345 by 0x34.

Observe that the registers are handled as 8-bit registers.

The divisor (blue) steadily shrinks with the quotient (inexperienced) taking its place.

On the finish, the rest is 0x41 (blue) and the quotient is 0xad, complement of the inexperienced worth.

tmpA | tmpC | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |

Though the `CORD`

routine is shared for byte and phrase division, the top-level microcode is totally different.

Particularly, the byte and phrase division directions use totally different registers, requiring microcode modifications.

The microcode under is the top-level code for byte division. It’s nearly the identical because the microcode above, besides it

makes use of the highest and backside bytes of the accumulator (`AH`

and `AL`

) moderately than the `AX`

and `DX`

registers.

AH → tmpAiDIV rmb:get arguments AL → tmpC RCL tmpA arrange RCL left shift operation M → tmpB CALL X0 PREIDIV deal with signed division if IDIV CALL CORD name CORD to carry out division COM1 tmpC complement the quotient AH → tmpB CALL X0 POSTIDIV deal with signed division if IDIV Σ → AL NXT retailer quotient tmpA → AH RNI retailer the rest, run subsequent instruction

## Signed division

The 8086 (like most computer systems) represents signed numbers utilizing a format known as two’s complement.

Whereas a daily byte holds a quantity from 0 to 255, a signed byte holds a quantity from -128 to 127.

A detrimental quantity is shaped by flipping all of the bits (often known as the one’s complement) after which including 1, yielding the 2’s complement worth.

As an example, +5 is `0x05`

whereas -5 is `0xfb`

.

(Observe that the highest little bit of a quantity is about for a detrimental quantity; that is the signal bit.)

The good factor about two’s complement numbers is that the identical addition and subtraction operations work on each signed and unsigned values.

Sadly, this isn’t the case for signed multiplication and division.

The 8086 has separate `IDIV`

(Integer Divide) directions to carry out signed division.

The 8086 performs signed division by changing the arguments to constructive values, performing unsigned division, after which

negating the outcomes if obligatory.

As proven earlier, signed and unsigned division each use the identical top-level microcode and the microcode conditionally calls some subroutines for

signed division.

These further subroutines trigger a major efficiency penalty, making signed division over 20 cycles slower than unsigned division.

I’ll talk about these micro-subroutines under.

`PREIDIV`

The primary subroutine for signed division is `PREIDIV`

, performing preliminary operations for integer division.

It converts the 2 arguments, saved in tmpA/tmpC and tmpB, to constructive values.

It retains monitor of the indicators utilizing an inside flag known as `F1`

, toggling this flag for every detrimental argument.

This conveniently handles the rule that two negatives make a constructive since complementing the `F1`

flag twice will clear it.

The purpose of that is that the primary division code (`CORD`

) solely must deal with unsigned division.

The microcode under implements `PREIDIV`

.

First it checks if tmpA is detrimental, however

the 8086 doesn’t have a microcode situation to straight check the signal of a price.

As an alternative, the microcode determines if a price is detrimental by shifting the worth left, which strikes the highest (signal) bit into the carry flag.

The conditional leap (`NCY`

) then checks if the carry is obvious, leaping if the worth is non-negative.

If tmpA is detrimental, execution falls by way of to negate the primary argument.

That is tough as a result of the argument is cut up between the tmpA and tmpC registers.

The 2’s complement operation (`NEG`

) is utilized to the low phrase, whereas both 2’s complement or one’s complement (`COM1`

) is utilized to

the higher phrase, relying on the carry for mathematical causes.5

The `F1`

flag is complemented to maintain monitor of the signal.

(The multiplication course of reuses most of this code, beginning on the `NEGATE`

entry level.)

Σ → no destPREIDIV:shift tmpA left JMPS NCY 7 leap if non-negative NEG tmpCNEGATE:negate tmpC Σ → tmpC COM1 tmpA F possibly complement tmpA JMPS CY 6 NEG tmpA negate tmpA if there is not any carry Σ → tmpA CF16:toggle F1 (signal) RCL tmpB7:check signal of tmpB Σ → no dest NEG tmpB possibly negate tmpB JMPS NCY 11 skip if tmpB constructive Σ → tmpB CF1 RTN else negate tmpB, toggle F1 (signal) RTN11:return

The following a part of the code, beginning at *7*, negates tmpB (the divisor) whether it is detrimental. Because the divisor is a single

phrase, this code is easier.

As earlier than, the `F1`

flag is toggled if tmpB is detrimental.

On the finish, each arguments (tmpA/tmpC and tmpB) are constructive, and `F1`

signifies the signal of the outcome.

`POSTIDIV`

After computing the outcome, the `POSTIDIV`

routine known as for signed division.

The routine first checks for a signed overflow and raises a divide-by-zero interrupt if that’s the case.

Subsequent, the routine negates the quotient and the rest if obligatory.6

In additional element, the `CORD`

routine left the highest little bit of tmpC (the complemented quotient) within the carry flag.

Now, that bit is examined. If the carry bit is 0 (`NCY`

), then the highest little bit of the quotient is 1 so the quotient is just too huge to slot in a signed worth.7

On this case, the `INT0`

routine is executed to set off a kind 0 interrupt, indicating a divide overflow.

(It is a moderately roundabout method of testing the quotient, counting on a carry bit that was set in a earlier subroutine.)

JMP NCY INT0POSTIDIV:if overflow, set off interrupt RCL tmpB arrange rotate of tmpB Σ → no dest NEG tmpA get signal of tmpB, arrange negate of tmpA JMPS NCY 5 skip if tmpB non-negative Σ → tmpA in any other case negate tmpA (the rest) INC tmpC5:arrange increment JMPS F1 8 check signal flag, skip if set COM1 tmpC in any other case arrange complement CCOF RTN8:clear carry and overflow flags, return

Subsequent, tmpB (the divisor) is rotated to see whether it is detrimental.

(The caller loaded tmpB with the unique divisor, changing the dividend that was in tmpB beforehand.)

If the divisor is detrimental, tmpA (the rest) is negated.

This implements the 8086 rule that the signal of the rest matches the signal of the divisor.

The quotient dealing with is a bit tough. Recall that tmpC holds the complemented quotient.

the `F1`

flag is about if the outcome must be detrimental. In that case, the complemented quotient must be incremented

by 1 (`INC`

) to transform from 1’s complement to 2’s complement.

However, if the quotient must be constructive, 1’s-complementing tmpC (`COM1`

) will yield the specified constructive

quotient.

In both case, the ALU is configured in `POSTIDIV`

, however the outcome might be saved again in the primary routine.

Lastly, the `CCOF`

micro-operation clears the carry and overflow flags.

Curiously, the 8086 documentation declares that the standing flags are undefined following `IDIV`

, however the microcode

explicitly clears the carry and overflow flags.

I assume that the flags had been cleared in analogy with `MUL`

, however then Intel determined that this wasn’t helpful so that they

did not doc it. (Documenting this characteristic would obligate them to supply the identical performance in later x86 chips.)

## The {hardware} for division

For probably the most half, the 8086 makes use of the common ALU addition and shifts for the division algorithm. Some particular {hardware}

options present help.

On this part, I am going to take a look at this {hardware}.

### Loop counter

The 8086 has a 4-bit loop counter for multiplication and division. This counter begins at 7 for byte division and 15 for phrase division,

based mostly on the low little bit of the opcode.

This loop counter permits the microcode to decrement the counter, check for the top, and carry out a conditional department in a single micro-operation.

The counter is carried out with 4 flip-flops, together with logic to compute the worth after decrementing by one.

The `MAXC`

(Most Depend) micro-instruction units the counter to 7 or 15 for byte or phrase operations respectively.

The `NCZ`

(Not Counter Zero) micro-instruction has two actions. First, it performs a conditional leap if the counter is nonzero.

Second, it decrements the counter.

### The F1 flag

Signed multiplication and division use an inside flag known as `F1`

8 to maintain monitor of the signal.

The `F1`

flag is toggled by microcode by way of the `CF1`

(Complement F1) micro-instruction.

The `F1`

flag is carried out with a flip-flop, together with a multiplexer to pick out the worth. It’s cleared when a brand new instruction begins,

set by a `REP`

prefix, and toggled by the `CF1`

micro-instruction.

The diagram under reveals how the F1 latch and the loop counter seem on the die. On this picture, the metallic layer has been eliminated, exhibiting the

silicon and the polysilicon wiring beneath.

The counter and F1 latch as they seem on the die. The latch for the REP state can also be right here.

### X register

The division microcode makes use of an inside register known as the `X`

register to differentiate between the `DIV`

and `IDIV`

directions.

The `X`

register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.9

Because the instruction is held within the Instruction Register, you may surprise why a separate register is required.

The motivation is that some opcodes specify the kind of ALU operation within the second byte of the instruction, the ModR/M byte, bits 5–3.10

Because the ALU operation is typically specified within the first byte and typically within the second byte, the `X`

register was added to deal with

each these circumstances.

For probably the most half, the `X`

register signifies which of the eight normal ALU operations is chosen (`ADD`

, `OR`

, `ADC`

, `SBB`

, `AND`

, `SUB`

, `XOR`

, `CMP`

).

Nevertheless, a number of directions use bit 0 of the `X`

register to differentiate between different pairs of directions.

As an example, it distinguishes between `MUL`

and `IMUL`

, `DIV`

and `IDIV`

, `CMPS`

and `SCAS`

, `MOVS`

and `LODS`

, or `AAA`

and `AAS`

.

Whereas these instruction pairs might seem to have arbitrary opcodes, they’ve been rigorously assigned

so the microcode can distinguish them.

The implementation of the `X`

register is simple, consisting of three flip-flops to carry the three bits of the instruction.

The flip-flops are loaded from the prefetch queue bus throughout First Clock and through Second Clock for acceptable directions, because the

instruction bytes journey over the bus.

Testing bit 0 of the `X`

register with the `X0`

situation is supported by the microcode situation analysis circuitry, so it may be used for conditional jumps within the microcode.

## Algorithmic and historic context

As you may see from the microcode, division is a sophisticated and comparatively gradual course of.

On the 8086, division takes as much as 184 clock cycles to carry out all of the microcode steps.

(Compared, two registers could be added in 3 clock cycles.)

Multiplication and division each loop over the bits, performing repeated addition or subtraction respectively.

However division requires a choice (subtract or not?) at every step, making it even slower, about half the pace of multiplication.11

Numerous algorithms have been developed to hurry up division.

Moderately than performing lengthy division one bit at a time, you are able to do lengthy division in, say, base 4, producing two quotient

bits in every step.

As with decimal lengthy division, the tough half is determining what digit to pick out. The SRT algorithm (1957) makes use of a small

lookup desk to estimate the quotient digit from a number of bits of the divisor and dividend.

The intelligent half is that the chosen digit would not should be precisely proper at every step; the algorithm will self-correct

after a flawed “guess”.

The Pentium processor (1993) famously had a floating point division bug

due to some lacking values within the SRT desk. This bug value Intel $475 million to switch the defective processors.

Intel’s x86 processors steadily improved divide efficiency. The 80286 (1982) carried out a phrase divide in 22 clocks, about

6 occasions as quick because the 8086.

Within the Penryn structure (2007), Intel upgraded from

Radix-4 to Radix-16 division.

Moderately than having separate integer and floating-point {hardware}, integer divides had been dealt with by way of the floating level divider.

Though trendy Intel processors have enormously improved multiplication and division in comparison with the 8086, division remains to be a comparatively gradual operation.

Whereas a Tiger Lake (2020) processor can carry out an integer multiplication each clock cycle (with a latency of three cycles),

division is way slower and may solely be completed as soon as each 6-10 clock cycles (details).

I’ve written quite a few posts on the 8086 up to now and

plan to proceed reverse-engineering the 8086 die so

observe me on Twitter @kenshirriff or RSS for updates.

I’ve additionally began experimenting with Mastodon just lately as @[email protected].