# Reverse-engineering the multiplication algorithm within the Intel 8086 processor

*by*Phil Tadros

Whereas programmers in the present day take multiplication without any consideration, most microprocessors within the Seventies may solely add and subtract — multiplication required a sluggish and tedious loop applied in meeting code.1

One of many good options of the Intel 8086 processor (1978) was

that it supplied machine directions for multiplication,2 capable of multiply 8-bit or 16-bit numbers with a single instruction.

Internally, the 8086 nonetheless carried out a loop, however the loop was applied in microcode: quicker and clear to

the programmer.

Even so, multiplication was a sluggish operation, about 24 to 30 instances slower than addition.

On this weblog publish, I clarify the multiplication course of contained in the 8086, analyze the microcode that it used, and talk about the {hardware}

circuitry that helped it out.3

My evaluation is predicated on reverse-engineering the 8086 from die photographs. The die picture under exhibits the chip underneath a microscope.

I’ve labeled the important thing purposeful blocks; those which can be vital to this publish are darker.

On the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations on the coronary heart of multiplication: addition and shifts.

Multiplication additionally makes use of a couple of different {hardware} options: the X register, the F1 flag, and a loop counter.

The microcode ROM on the decrease proper controls the method.

The 8086 die underneath a microscope, with fundamental purposeful blocks labeled. This picture exhibits the chip with the steel and polysilicon eliminated, revealing the silicon beneath. Click on on this picture (or every other) for a bigger model.

## Microcode

The multiplication routines within the 8086 are applied in microcode.

Most individuals consider machine directions as the fundamental steps that a pc performs.

Nevertheless, many processors (together with the 8086) have one other layer of software program beneath: microcode.

With microcode, as a substitute of constructing the management circuitry from advanced logic gates, the management logic is essentially changed with code.

To execute a machine instruction, the pc internally executes a number of easier micro-instructions, specified by the microcode.

That is particularly helpful for a machine instruction comparable to multiplication, which requires many steps in a loop.

A micro-instruction within the 8086 is encoded into 21 bits as proven under.

Each micro-instruction has a transfer from a supply register to a vacation spot register, every specified with 5 bits.

The which means of the remaining bits depends upon the kind discipline and may be something from an ALU operation to a reminiscence learn or write to

a change of microcode management movement.

Thus, an 8086 micro-instruction usually does two issues in parallel: the transfer and the motion.

For extra about 8086 microcode, see my microcode blog post.

The conduct of an ALU micro-operation is vital for multiplication.

The ALU has three short-term registers which can be invisible to the programmer: tmpA, tmpB, and tmpC.

An ALU operation takes its first argument from any short-term register, whereas the second argument at all times comes from tmpB.

An ALU operation requires two micro-instructions.

The primary micro-instruction specifies the ALU operation and supply register, configuring the ALU. As an example, `ADD tmpA`

so as to add tmpA to the default tmpB.

Within the subsequent micro-instruction (or a later one), the ALU end result may be accessed via the `Σ`

register and moved to a different register.

Earlier than I get into the microcode routines, I ought to clarify two ALU operations that play a central position in multiplication: `LRCY`

and `RRCY`

, Left Rotate via Carry and Proper Rotate via Carry.

(These correspond to the `RCL`

and `RCR`

machine directions, which rotate via carry left or proper.)

These operations shift the bits in a 16-bit phrase, just like the `<<`

and `>>`

bit-shift operations in high-level languages, however with an extra characteristic.

As an alternative of discarding the bit on the top, that bit is moved into the carry flag (`CF`

). In the meantime, the bit previously within the carry flag strikes into the phrase.

You may consider this as rotating the bits whereas treating the carry flag as a seventeenth little bit of the phrase.

The left rotate via carry and proper rotate via carry micro-instructions.

These shifts carry out an vital a part of the multiplication course of since shifting may be seen as multiplying by two.

`LRCY`

additionally supplies a handy option to transfer the most-significant bit to the carry flag, the place it may be examined for a conditional leap.

(That is vital as a result of the highest bit is used because the signal bit.)

Equally, `RRCY`

supplies entry to the least important bit, crucial for the multiplication course of.

One other vital property is that performing `RRCY`

on an higher phrase after which `RRCY`

on a decrease phrase will carry out a 32-bit shift, since

the low little bit of the higher phrase can be moved into the excessive little bit of the decrease phrase through the carry bit.

## Binary multiplication

The shift-and-add technique of multiplication (under) is just like grade-school lengthy multiplication, besides it makes use of binary as a substitute of decimal.

In every row, the multiplicand is multiplied by one digit of the multiplier. (The multiplicand is the worth that will get repeatedly added, and the multiplier

controls what number of instances it will get added.)

Successive rows are shifted left one digit.

On the backside, the rows are added collectively to yield the product.

The instance under exhibits how 6×5 is calculated in binary utilizing lengthy multiplication.

0110

×0101

0110

0000

0110

0000

00011110

Binary lengthy multiplication is way easier than decimal multiplication: at every step, you are multiplying by 0 or 1.

Thus, every row is both zero or the multiplicand appropriately shifted (0110 on this case).

(Not like decimal lengthy multiplication, you need not know the multiplication desk.)

This simplifies the {hardware} implementation, since every step both provides the multiplicand or would not.

In different phrases, every step assessments a little bit of the multiplier, beginning with the low bit, to find out if an add ought to happen or not.

This bit may be obtained by shifting the multiplier one bit to the correct every step.

Though the diagram above exhibits the sum on the finish, an actual implementation performs the addition at every step of the loop, protecting a working complete.

Furthermore, within the 8086, as a substitute of shifting the multiplicand to the left throughout every step, the sum shifts to the correct.

(The end result is identical but it surely makes the implementation simpler.)

Thus, multiplying 6×5 goes via the steps under.

0101

×0110

00000

001010

0011110

00011110

Why would you shift the end result to the correct?

There is a intelligent motive for this.

Suppose you are multiplying two 16-bit numbers, which yields a 32-bit end result. That requires 4 16-bit phrases of storage if you happen to use the

simple strategy.

However if you happen to look extra intently, the primary sum matches into 16 bits, and then you definitely want yet one more bit at every step. In the meantime, you are “utilizing up”

one little bit of the multiplier at every step.

So if you happen to squeeze the sum and the multiplier collectively, you possibly can match them into two phrases.

Shifting proper accomplishes this, because the diagram under illustrates for `0xffff`

×`0xf00f`

. The sum (blue) begins in a 16-bit register known as `tmpA`

whereas the multiplier (inexperienced) is saved within the 16-bit `tmpB`

register.

In every step, they’re each shifted proper, so the sum positive factors one bit and the multiplier loses one bit. By the top, the sum takes up all 32 bits,

break up throughout each registers.

sum (tmpA) | multiplier (tmpC) | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |

0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |

1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |

1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 |

1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |

## The multiplication microcode

The 8086 has 4 multiply directions to deal with signed and unsigned multiplication of byte and phrase operands.

These machine directions are applied in microcode.

I am going to begin by describing the unsigned phrase multiplication, which multiplies two 16-bit values and produces a 32-bit end result.

The supply phrase is supplied by both a register or reminiscence. It’s multiplied by `AX`

, the accumulator register.

The 32-bit result’s returned within the `DX`

and `AX`

registers.

The microcode under is the primary routine for phrase multiplication, each signed and unsigned.

Every micro-instruction specifies a register transfer on the left, and an motion on the correct.

The strikes switch phrases between the seen registers and the ALU’s short-term registers, whereas the actions are largely subroutine calls

to different micro-routines.

transfer motion AX → tmpC LRCY tmpCiMUL rmw:M → tmpB CALL X0 PREIMUL known as for signed multiplication CALL CORX the core routine CALL F1 NEGATE known as for adverse end result CALL X0 IMULCOF known as for signed multiplication tmpC → AX JMPS X0 7 CALL MULCOF known as for unsigned multiplication tmpA → DX RNI

The microcode begins by shifting one argument `AX`

into the ALU’s short-term C register and organising the ALU to carry out a Left Rotate via Keep it up this register, in an effort to entry the signal bit.

Subsequent, it strikes the second argument `M`

into the short-term B register; `M`

references the register or reminiscence specified within the second byte

of the instruction, the “ModR/M” byte.

For a signed multiply instruction, the `PREIMUL`

micro-subroutine known as, however I am going to skip that for now.

(The `X0`

situation assessments bit 3 of the instruction, which on this case distinguishes `MUL`

from `IMUL`

.)

Subsequent, the `CORX`

subroutine known as, which is the guts of the multiplication.4

If the end result must be negated (indicated by the `F1`

situation), the `NEGATE`

micro-subroutine known as.

For signed multiplication, `IMULCOF`

is then known as to set the carry and overflow flags, whereas `MULCOF`

known as for unsigned multiplication.

In the meantime, the end result bytes are moved from the short-term C and short-term registers to the `AX`

and `DX`

registers.

Lastly, `RNI`

runs the following machine instruction, ending the microcode routine.

`CORX`

The guts of the multiplication code is the `CORX`

routine, which performs the multiplication loop, computing the product via shifts and provides.

The primary two traces arrange the loop, initializing the sum (tmpA) to 0.

The variety of loops is managed by a special-purpose loop counter.

The `MAXC`

micro-instruction initializes the counter to 7 or 15, for a byte or phrase multiply respectively.

The primary shift of tmpC is carried out, placing the low bit into the carry flag.

The loop physique performs the shift-and-add step.

It assessments the carry flag, the low little bit of the multiplicand. It skips over the `ADD`

if there isn’t any carry (`NCY`

).

In any other case, tmpB is added to tmpA. (As tmpA will get shifted to the correct, tmpB will get added to larger and better positions within the end result.)

The tmpA and tmpC registers are rotated proper. This additionally places the following little bit of the multiplicand into the carry flag for the following cycle.

The microcode jumps to the highest of the loop if the counter shouldn’t be zero (`NCZ`

). In any other case, the subroutine returns with the lead to tmpA and tmpC.

ZERO → tmpA RRCY tmpCCORX:initialize proper rotate Σ → tmpC MAXC get rotate end result, initialize counter to max worth JMPS NCY 85:prime of loop ADD tmpA conditionally add Σ → tmpA F sum to tmpA, replace flags to get carry RRCY tmpA8:32-bit shift of tmpA/tmpC Σ → tmpA RRCY tmpC Σ → tmpC JMPS NCZ 5 loop to five if counter shouldn't be 0 RTN

`MULCOF`

The final subroutine is `MULCOF`

, which configures the carry and overflow flags.

The 8086 makes use of the rule that if the higher half of the result’s nonzero, the carry and overflow flags are set, in any other case they’re cleared.

The primary two traces move tmpA (the higher half of the end result) via the ALU to set the

zero flag for the conditional leap. As a side-effect, the opposite standing flags will get set however these values are “undefined” within the documentation.6

If the check is nonzero, the carry and overflow flags are set (`SCOF`

), in any other case they’re cleared (`CCOF`

).5

The `SCOF`

and `CCOF`

micro-operations had been applied solely for utilized by multiplication, illustrating how microcode may be designed round

particular wants.

PASS tmpAMULCOF:move tmpA via to check if zero Σ → no dest JMPS 12 F replace flags JMPS Z 812:leap if zero SCOF RTN in any other case set carry and overflow CCOF RTN8:clear carry and overflow

## 8-bit multiplication

The 8086 has separate directions for 8-bit multiplication.

The method for 8-bit multiplication is just like 16-bit multiplication, besides the values are half as lengthy and the shift-and-add loop executes 8 instances as a substitute of 16.

As proven under, the 8-bit sum begins within the low half of the short-term A register and is shifted left into tmpC.

In the meantime, the 8-bit multiplier begins within the low half of tmpC and is shifted out to the correct.

On the finish, the result’s break up between tmpA and tmpC.

sum (tmpA) | multiplier (tmpC) | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

The 8086 helps many directions with byte and phrase variations, utilizing 8-bit or 16-bit arguments.

Usually, the byte and phrase directions use the identical microcode, with the ALU and register {hardware} utilizing bytes or phrases based mostly on the instruction.

Nevertheless, the byte- and word-multiply directions use totally different registers, requiring microcode modifications.

Specifically, the multiplier is in `AL`

, the low half of the accumulator.

On the finish, the 16-bit result’s returned in `AX`

, the total 16-bit accumulator; two micro-instructions assemble the end result from tmpC and tmpA into

the 2 bytes of the accumulator, ‘AL’ and ‘AH’ respectively.

Aside from these modifications, the microcode is identical because the phrase multiply microcode mentioned earlier.

`AL → tmpC LRCY tmpC `**iMUL rmb:**
M → tmpB CALL X0 PREIMUL
CALL CORX
CALL F1 NEGATE
CALL X0 IMULCOF
tmpC → AL JMPS X0 7
CALL MULCOF
tmpA → AH RNI

## Signed multiplication

The 8086 (like most computer systems) represents signed numbers utilizing a format known as two’s complement.

Whereas an everyday byte holds a quantity from 0 to 255, a signed byte holds a quantity from -128 to 127.

A adverse quantity is shaped by flipping all of the bits (generally known as the one’s complement) after which including 1, yielding the 2’s complement worth.7

As an example, +5 is `0x05`

whereas -5 is `0xfb`

.

(Observe that the highest little bit of a quantity is about for a adverse quantity; that is the signal bit.)

The great factor about two’s complement numbers is that the identical addition and subtraction operations work on each signed and unsigned values.

Sadly, this isn’t the case for signed multiplication, since signed and unsigned values yield totally different outcomes.8

The 8086 has separate multiplication directions `IMUL`

(Integer Multiply) to carry out signed multiplication.

The 8086 performs signed multiplication by changing the arguments to constructive values, performing unsigned multiplication, after which

negating the end result if needed.

As proven above, signed and unsigned multiplication each use the identical microcode, however the microcode conditionally calls some subroutines for

signed multiplication.

I’ll talk about these micro-subroutines under.

`PREIMUL`

The primary subroutine for signed multiplication is `PREIMUL`

, performing preliminary operations for integer multiplication.

It converts the 2 arguments, saved in tmpC and tmpB, to constructive values.

It retains monitor of the indicators utilizing an inside flag known as `F1`

, toggling this flag for a adverse argument.

This conveniently handles the rule that two negatives make a constructive since complementing the `F1`

flag twice will clear it.

This microcode, under, illustrates the complexity of microcode and the way micro-operations are rigorously organized to get the correct values on the proper time.

The primary micro-instruction performs one ALU operation and units up a second operation.

The calling code had arrange the ALU to carry out `LRCY tmpC`

, so that is the end result returned by Σ (and discarded).

Performing a left rotate and discarding the end result could seem pointless, however the vital side-effect is that the highest bit

(i.e. the signal bit) leads to the carry flag.

The microcode doesn’t have a conditional leap based mostly on the signal, however has a conditional leap based mostly on carry, so the purpose is

to check if tmpC is adverse.

The primary micro-instruction additionally units up negation (`NEG tmpC`

) for the *subsequent* ALU operation.

Σ → no dest NEG tmpCPREIMUL:arrange negation of tmpC JMPS NCY 7 leap if tmpC constructive Σ → tmpC CF1 if adverse, negate tmpC, flip F1 JMPS 7 leap to shared code LRCY tmpB7:Σ → no dest NEG tmpB arrange negation of tmpB JMPS NCY 11 leap if tmpB constructive Σ → tmpB CF1 RTN if adverse, negate tmpB, flip F1 RTN11:return

For the remaining traces,

if the carry is obvious (`NCY`

), the following two traces are skipped. In any other case, the ALU end result (`Σ`

) is written to tmpC, making it constructive,

and the `F1`

flag is complemented with `CF1`

. (The second quick leap (`JMPS`

) might look pointless, however I reordered the code for readability.)

The second half of the microcode performs an analogous check on tmpB. If tmpB is adverse, it’s negated and `F1`

is toggled.

`NEGATE`

The microcode under known as after computing the end result, if the end result must be made adverse.

Negation is more durable than you would possibly anticipate as a result of the result’s break up between the tmpA and tmpC registers.

The 2’s complement operation (`NEG`

) is utilized to the low phrase, whereas both 2’s complement or one’s complement (`COM1`

) is utilized to

the higher phrase, relying on the carry for mathematical causes.9

The code additionally toggles F1 and makes tmpB constructive; I feel this code is barely helpful for division, which additionally makes use of the `NEGATE`

subroutine.

NEG tmpCNEGATE:negate tmpC Σ → tmpC COM1 tmpA F perhaps complement tmpA JMPS CY 6 NEG tmpA negate tmpA if there is no carry Σ → tmpA CF16:toggle F1 for some motive LRCY tmpB7:check signal of tmpB Σ → no dest NEG tmpB perhaps negate tmpB JMPS NCY 11 skip if tmpB constructive Σ → tmpB CF1 RTN else negate tmpB, toggle F1 RTN11:return

`IMULCOF`

The `IMULCOF`

routine is just like `MULCOF`

, however the calculation is a bit trickier for a signed end result.

This routine units the carry and overflow flags if the higher half of the result’s important, that’s, it’s not

simply the signal extension of the decrease half.10

In different phrases, the highest byte shouldn’t be important if it duplicates the highest bit (the signal bit) of the decrease byte.

The trick within the microcode is so as to add the highest little bit of the decrease byte to the higher byte by placing it within the carry flag

and performing an add with carry (`ADC`

) of 0.

If the result’s 0, the higher byte shouldn’t be important, dealing with the constructive and adverse instances.

(This additionally holds for phrases as a substitute of bytes.)

ZERO → tmpB LRCY tmpCIMULCOF:get prime little bit of tmpC Σ → no dest ADC tmpA add to tmpA and 0 (tmpB) Σ → no dest F replace flags JMPS Z 812: leap if zero end resultSCOF RTN in any other case set carry and overflow CCOF RTN8:clear carry and overflow

## The {hardware} for multiplication

For essentially the most half, the 8086 makes use of the common ALU addition and shifts for the multiplication algorithm. Some particular {hardware}

options present help.

### Loop counter

The 8086 has a particular 4-bit loop counter for multiplication. This counter begins at 7 for byte multiplication and 15 for phrase multiplication,

based mostly on the instruction.

This loop counter permits the microcode to decrement the counter, check for the top, and carry out a conditional department in a single micro-operation.

The counter is applied with 4 flip-flops, together with logic to compute the worth after decrementing by one.

The `MAXC`

(Most Depend) micro-instruction units the counter to 7 or 15 for byte or phrase operations respectively.

The `NCZ`

(Not Counter Zero) micro-instruction has two actions. First, it performs a conditional leap if the counter is nonzero.

Second, it decrements the counter.

### X register

The multiplication microcode makes use of an inside register known as the `X`

register to tell apart between the `MUL`

and `IMUL`

directions.

The `X`

register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.11

For the reason that instruction is held within the Instruction Register, you would possibly marvel why a separate register is required.

The motivation is that some opcodes specify the kind of ALU operation within the second byte of the instruction, the ModR/M byte, bits 5–3.12

For the reason that ALU operation is typically specified within the first byte and typically within the second byte, the `X`

register was added to deal with

each these instances.

For essentially the most half, the `X`

register signifies which of the eight customary ALU operations is chosen (`ADD`

, `OR`

, `ADC`

, `SBB`

, `AND`

, `SUB`

, `XOR`

, `CMP`

).

Nevertheless, a couple of directions use bit 0 of the `X`

register to tell apart between different pairs of directions.

As an example, it distinguishes between `MUL`

and `IMUL`

, `DIV`

and `IDIV`

, `CMPS`

and `SCAS`

, `MOVS`

and `LODS`

, or `AAA`

and `AAS`

.

Whereas these instruction pairs might seem to have arbitrary opcodes, they’ve been rigorously assigned.

The microcode can check this bit utilizing the `X0`

situation and carry out conditional jumps.

The implementation of the `X`

register is simple, consisting of three flip-flops to carry the three bits of the instruction.

The flip-flops are loaded from the prefetch queue bus throughout First Clock and through Second Clock for acceptable directions, because the

instruction bytes journey over the bus.

Testing bit 0 of the `X`

register with the `X0`

situation is supported by the microcode situation analysis circuitry, so it may be used for conditional jumps within the microcode.

### The F1 flag

The multiplication microcode makes use of an inside flag known as `F1`

,13 which has two distinct makes use of.

The flag retains monitor of a `REP`

prefix to be used with a string operation.

However the `F1`

flag can also be utilized by signed multiplication and division to maintain monitor of the signal.

The `F1`

flag may be toggled by microcode via the `CF1`

(Complement F1) micro-instruction.

The `F1`

flag is applied with a flip-flop, together with a multiplexer to pick out the worth. It’s cleared when a brand new instruction begins,

set by a `REP`

prefix, and toggled by the `CF1`

micro-instruction.

The diagram under exhibits how the F1 latch and the loop counter seem on the die. On this picture, the steel layer has been eliminated, displaying the

silicon and the polysilicon wiring beneath.

The counter and F1 latch as they seem on the die. The latch for the REP state can also be right here.

## Later advances in multiplication

The 8086 was fairly sluggish at multiplying in comparison with later Intel processors.14

The 8086 took as much as 133 clock cycles to multiply unsigned 16-bit values because of the difficult microcode loops.

By 1982, the Intel 286 processor lower this time right down to 21 clock cycles.

The Intel 486 (1989) used an improved algorithm that might finish early, so multiplying by a small quantity may take simply 9 cycles.

Though these optimizations improved efficiency, they nonetheless relied on looping over the bits.

With the shift to 32-bit processors, the loop time turned unwieldy.

The answer was to interchange the loop with {hardware}: as a substitute of performing 32 shift-and-add loops,

an array of adders may compute the multiplication in a single step.

This amount of {hardware} was unreasonable within the 8086 period, however as Moore’s legislation made transistors smaller and cheaper, {hardware} multiplication turned

sensible.

As an example, the Cyrix Cx486SLC (1992) had a 16-bit {hardware} multiplier that lower phrase multiply down to three cycles.

The Intel Core 2 (2006) was even quicker, capable of full a 32-bit multiplication each clock cycle.

{Hardware} multiplication is a reasonably difficult topic, with many optimizations to maximise efficiency whereas minimizing {hardware}.15

Merely changing the loop with a sequence of 32 adders is simply too sluggish as a result of the end result can be delayed whereas propagating via all of the adders.

The answer is to rearrange the adders as a tree to supply parallelism. The primary layer has 16 adders so as to add pairs of phrases. The following layer provides pairs of those partial

sums, and so forth. The ensuing tree of adders is 5 layers deep slightly than 32, decreasing the time to compute the sum.

Actual multipliers obtain additional efficiency enhancements by splitting up the adders and making a extra advanced tree:

the venerable Wallace tree (1964) and Dadda multiplier (1965) are

two fashionable approaches.

One other optimization is the Booth algorithm (1951), which performs signed

multiplication instantly, with out changing the arguments to constructive values first.

The Pentium 4 (2000) used a Sales space encoder and a Wallace tree

(ref), however analysis in

the early 2000s discovered the Dadda tree is quicker and it’s now extra fashionable.

## Conclusions

Multiplication is way more durable to compute than addition or subtraction. The 8086 processor hid this complexity from the programmer

by offering 4 multiplication directions for byte and phrase multiplication of signed or unsigned values.

These directions had been applied multiplication in microcode, performing shifts and provides in a loop.

By utilizing microcode subroutines and conditional execution, these 4 machine directions share a lot of the microcode.

Because the microcode capability of the 8086 was very small, this was a crucial characteristic of the implementation.

In case you made it via all of the dialogue of microcode, congratulations!

Microcode is even more durable to know than meeting code.

A part of the issue is that microcode may be very fine-grain, with even ALU operations break up into a number of steps.

One other complication is that 8086 microcode performs a register transfer and one other operation in parallel, so it is exhausting to

maintain monitor of what is going on on.

Microcode can appear a bit like a jigsaw puzzle, with items rigorously match collectively as compactly as doable.

I hope the reasons right here made sense, or at the very least gave you a really feel for a way microcode operates.

I’ve written a number of posts on the 8086 to date and

plan to proceed reverse-engineering the 8086 die so

observe me on Twitter @kenshirriff or RSS for updates.

I’ve additionally began experimenting with Mastodon lately as @[email protected].