Now Reading
Reverse-engineering the multiplication algorithm within the Intel 8086 processor

Reverse-engineering the multiplication algorithm within the Intel 8086 processor

2023-03-15 11:16:44

Whereas programmers in the present day take multiplication without any consideration, most microprocessors within the Seventies may solely add and subtract — multiplication required a sluggish and tedious loop applied in meeting code.1
One of many good options of the Intel 8086 processor (1978) was
that it supplied machine directions for multiplication,2 capable of multiply 8-bit or 16-bit numbers with a single instruction.
Internally, the 8086 nonetheless carried out a loop, however the loop was applied in microcode: quicker and clear to
the programmer.
Even so, multiplication was a sluggish operation, about 24 to 30 instances slower than addition.

On this weblog publish, I clarify the multiplication course of contained in the 8086, analyze the microcode that it used, and talk about the {hardware}
circuitry that helped it out.3
My evaluation is predicated on reverse-engineering the 8086 from die photographs. The die picture under exhibits the chip underneath a microscope.
I’ve labeled the important thing purposeful blocks; those which can be vital to this publish are darker.
On the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations on the coronary heart of multiplication: addition and shifts.
Multiplication additionally makes use of a couple of different {hardware} options: the X register, the F1 flag, and a loop counter.
The microcode ROM on the decrease proper controls the method.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die underneath a microscope, with fundamental purposeful blocks labeled. This picture exhibits the chip with the steel and polysilicon eliminated, revealing the silicon beneath. Click on on this picture (or every other) for a bigger model.

Microcode

The multiplication routines within the 8086 are applied in microcode.
Most individuals consider machine directions as the fundamental steps that a pc performs.
Nevertheless, many processors (together with the 8086) have one other layer of software program beneath: microcode.
With microcode, as a substitute of constructing the management circuitry from advanced logic gates, the management logic is essentially changed with code.
To execute a machine instruction, the pc internally executes a number of easier micro-instructions, specified by the microcode.
That is particularly helpful for a machine instruction comparable to multiplication, which requires many steps in a loop.

A micro-instruction within the 8086 is encoded into 21 bits as proven under.
Each micro-instruction has a transfer from a supply register to a vacation spot register, every specified with 5 bits.
The which means of the remaining bits depends upon the kind discipline and may be something from an ALU operation to a reminiscence learn or write to
a change of microcode management movement.
Thus, an 8086 micro-instruction usually does two issues in parallel: the transfer and the motion.
For extra about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The conduct of an ALU micro-operation is vital for multiplication.
The ALU has three short-term registers which can be invisible to the programmer: tmpA, tmpB, and tmpC.
An ALU operation takes its first argument from any short-term register, whereas the second argument at all times comes from tmpB.
An ALU operation requires two micro-instructions.
The primary micro-instruction specifies the ALU operation and supply register, configuring the ALU. As an example, ADD tmpA so as to add tmpA to the default tmpB.
Within the subsequent micro-instruction (or a later one), the ALU end result may be accessed via the Σ register and moved to a different register.

Earlier than I get into the microcode routines, I ought to clarify two ALU operations that play a central position in multiplication: LRCY and RRCY, Left Rotate via Carry and Proper Rotate via Carry.
(These correspond to the RCL and RCR machine directions, which rotate via carry left or proper.)
These operations shift the bits in a 16-bit phrase, just like the << and >> bit-shift operations in high-level languages, however with an extra characteristic.
As an alternative of discarding the bit on the top, that bit is moved into the carry flag (CF). In the meantime, the bit previously within the carry flag strikes into the phrase.
You may consider this as rotating the bits whereas treating the carry flag as a seventeenth little bit of the phrase.

The left rotate through carry and right rotate through carry micro-instructions.

The left rotate via carry and proper rotate via carry micro-instructions.

These shifts carry out an vital a part of the multiplication course of since shifting may be seen as multiplying by two.
LRCY additionally supplies a handy option to transfer the most-significant bit to the carry flag, the place it may be examined for a conditional leap.
(That is vital as a result of the highest bit is used because the signal bit.)
Equally, RRCY supplies entry to the least important bit, crucial for the multiplication course of.
One other vital property is that performing RRCY on an higher phrase after which RRCY on a decrease phrase will carry out a 32-bit shift, since
the low little bit of the higher phrase can be moved into the excessive little bit of the decrease phrase through the carry bit.

Binary multiplication

The shift-and-add technique of multiplication (under) is just like grade-school lengthy multiplication, besides it makes use of binary as a substitute of decimal.
In every row, the multiplicand is multiplied by one digit of the multiplier. (The multiplicand is the worth that will get repeatedly added, and the multiplier
controls what number of instances it will get added.)
Successive rows are shifted left one digit.
On the backside, the rows are added collectively to yield the product.
The instance under exhibits how 6×5 is calculated in binary utilizing lengthy multiplication.

    0110

   ×0101

    0110

   0000

  0110

 0000

00011110

Binary lengthy multiplication is way easier than decimal multiplication: at every step, you are multiplying by 0 or 1.
Thus, every row is both zero or the multiplicand appropriately shifted (0110 on this case).
(Not like decimal lengthy multiplication, you need not know the multiplication desk.)
This simplifies the {hardware} implementation, since every step both provides the multiplicand or would not.
In different phrases, every step assessments a little bit of the multiplier, beginning with the low bit, to find out if an add ought to happen or not.
This bit may be obtained by shifting the multiplier one bit to the correct every step.

Though the diagram above exhibits the sum on the finish, an actual implementation performs the addition at every step of the loop, protecting a working complete.
Furthermore, within the 8086, as a substitute of shifting the multiplicand to the left throughout every step, the sum shifts to the correct.
(The end result is identical but it surely makes the implementation simpler.)
Thus, multiplying 6×5 goes via the steps under.

  0101

 ×0110

 00000

 001010

 0011110

 00011110

Why would you shift the end result to the correct?
There is a intelligent motive for this.
Suppose you are multiplying two 16-bit numbers, which yields a 32-bit end result. That requires 4 16-bit phrases of storage if you happen to use the
simple strategy.
However if you happen to look extra intently, the primary sum matches into 16 bits, and then you definitely want yet one more bit at every step. In the meantime, you are “utilizing up”
one little bit of the multiplier at every step.
So if you happen to squeeze the sum and the multiplier collectively, you possibly can match them into two phrases.
Shifting proper accomplishes this, because the diagram under illustrates for 0xffff×0xf00f. The sum (blue) begins in a 16-bit register known as tmpA
whereas the multiplier (inexperienced) is saved within the 16-bit tmpB register.
In every step, they’re each shifted proper, so the sum positive factors one bit and the multiplier loses one bit. By the top, the sum takes up all 32 bits,
break up throughout each registers.

sum (tmpA) multiplier (tmpC)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1
1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1
1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0
0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0
0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0
0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0
0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0
0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1
1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1
1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1
1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1

The multiplication microcode

The 8086 has 4 multiply directions to deal with signed and unsigned multiplication of byte and phrase operands.
These machine directions are applied in microcode.
I am going to begin by describing the unsigned phrase multiplication, which multiplies two 16-bit values and produces a 32-bit end result.
The supply phrase is supplied by both a register or reminiscence. It’s multiplied by AX, the accumulator register.
The 32-bit result’s returned within the DX and AX registers.

The microcode under is the primary routine for phrase multiplication, each signed and unsigned.
Every micro-instruction specifies a register transfer on the left, and an motion on the correct.
The strikes switch phrases between the seen registers and the ALU’s short-term registers, whereas the actions are largely subroutine calls
to different micro-routines.

  transfer        motion
AX → tmpC   LRCY tmpC        iMUL rmw:
M → tmpB    CALL X0 PREIMUL   known as for signed multiplication
            CALL CORX         the core routine
            CALL F1 NEGATE    known as for adverse end result
            CALL X0 IMULCOF   known as for signed multiplication
tmpC → AX   JMPS X0 7  
            CALL MULCOF       known as for unsigned multiplication
tmpA → DX   RNI  

The microcode begins by shifting one argument AX into the ALU’s short-term C register and organising the ALU to carry out a Left Rotate via Keep it up this register, in an effort to entry the signal bit.
Subsequent, it strikes the second argument M into the short-term B register; M references the register or reminiscence specified within the second byte
of the instruction, the “ModR/M” byte.
For a signed multiply instruction, the PREIMUL micro-subroutine known as, however I am going to skip that for now.
(The X0 situation assessments bit 3 of the instruction, which on this case distinguishes MUL from IMUL.)
Subsequent, the CORX subroutine known as, which is the guts of the multiplication.4
If the end result must be negated (indicated by the F1 situation), the NEGATE micro-subroutine known as.
For signed multiplication, IMULCOF is then known as to set the carry and overflow flags, whereas MULCOF known as for unsigned multiplication.
In the meantime, the end result bytes are moved from the short-term C and short-term registers to the AX and DX registers.
Lastly, RNI runs the following machine instruction, ending the microcode routine.

CORX

The guts of the multiplication code is the CORX routine, which performs the multiplication loop, computing the product via shifts and provides.
The primary two traces arrange the loop, initializing the sum (tmpA) to 0.
The variety of loops is managed by a special-purpose loop counter.
The MAXC micro-instruction initializes the counter to 7 or 15, for a byte or phrase multiply respectively.
The primary shift of tmpC is carried out, placing the low bit into the carry flag.

The loop physique performs the shift-and-add step.
It assessments the carry flag, the low little bit of the multiplicand. It skips over the ADD if there isn’t any carry (NCY).
In any other case, tmpB is added to tmpA. (As tmpA will get shifted to the correct, tmpB will get added to larger and better positions within the end result.)
The tmpA and tmpC registers are rotated proper. This additionally places the following little bit of the multiplicand into the carry flag for the following cycle.
The microcode jumps to the highest of the loop if the counter shouldn’t be zero (NCZ). In any other case, the subroutine returns with the lead to tmpA and tmpC.

ZERO → tmpA  RRCY tmpC   CORX: initialize proper rotate
Σ → tmpC     MAXC          get rotate end result, initialize counter to max worth
             JMPS NCY 8  5: prime of loop
             ADD tmpA     conditionally add
Σ → tmpA               F  sum to tmpA, replace flags to get carry
             RRCY tmpA   8: 32-bit shift of tmpA/tmpC
Σ → tmpA     RRCY tmpC  
Σ → tmpC     JMPS NCZ 5   loop to five if counter shouldn't be 0
             RTN  

MULCOF

The final subroutine is MULCOF, which configures the carry and overflow flags.
The 8086 makes use of the rule that if the higher half of the result’s nonzero, the carry and overflow flags are set, in any other case they’re cleared.
The primary two traces move tmpA (the higher half of the end result) via the ALU to set the
zero flag for the conditional leap. As a side-effect, the opposite standing flags will get set however these values are “undefined” within the documentation.6
If the check is nonzero, the carry and overflow flags are set (SCOF), in any other case they’re cleared (CCOF).5
The SCOF and CCOF micro-operations had been applied solely for utilized by multiplication, illustrating how microcode may be designed round
particular wants.

             PASS tmpA  MULCOF: move tmpA via to check if zero
Σ → no dest  JMPS 12   F replace flags

             JMPS Z 8   12: leap if zero
             SCOF RTN    in any other case set carry and overflow

             CCOF RTN   8: clear carry and overflow

8-bit multiplication

The 8086 has separate directions for 8-bit multiplication.
The method for 8-bit multiplication is just like 16-bit multiplication, besides the values are half as lengthy and the shift-and-add loop executes 8 instances as a substitute of 16.
As proven under, the 8-bit sum begins within the low half of the short-term A register and is shifted left into tmpC.
In the meantime, the 8-bit multiplier begins within the low half of tmpC and is shifted out to the correct.
On the finish, the result’s break up between tmpA and tmpC.

sum (tmpA) multiplier (tmpC)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0
0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0

The 8086 helps many directions with byte and phrase variations, utilizing 8-bit or 16-bit arguments.
Usually, the byte and phrase directions use the identical microcode, with the ALU and register {hardware} utilizing bytes or phrases based mostly on the instruction.
Nevertheless, the byte- and word-multiply directions use totally different registers, requiring microcode modifications.
Specifically, the multiplier is in AL, the low half of the accumulator.
On the finish, the 16-bit result’s returned in AX, the total 16-bit accumulator; two micro-instructions assemble the end result from tmpC and tmpA into
the 2 bytes of the accumulator, ‘AL’ and ‘AH’ respectively.
Aside from these modifications, the microcode is identical because the phrase multiply microcode mentioned earlier.

AL → tmpC    LRCY tmpC         iMUL rmb:
M → tmpB     CALL X0 PREIMUL  
             CALL CORX  
             CALL F1 NEGATE  
             CALL X0 IMULCOF  
tmpC → AL    JMPS X0 7  
             CALL MULCOF  
tmpA → AH    RNI

Signed multiplication

The 8086 (like most computer systems) represents signed numbers utilizing a format known as two’s complement.
Whereas an everyday byte holds a quantity from 0 to 255, a signed byte holds a quantity from -128 to 127.
A adverse quantity is shaped by flipping all of the bits (generally known as the one’s complement) after which including 1, yielding the 2’s complement worth.7
As an example, +5 is 0x05 whereas -5 is 0xfb.
(Observe that the highest little bit of a quantity is about for a adverse quantity; that is the signal bit.)
The great factor about two’s complement numbers is that the identical addition and subtraction operations work on each signed and unsigned values.
Sadly, this isn’t the case for signed multiplication, since signed and unsigned values yield totally different outcomes.8

The 8086 has separate multiplication directions IMUL (Integer Multiply) to carry out signed multiplication.
The 8086 performs signed multiplication by changing the arguments to constructive values, performing unsigned multiplication, after which
negating the end result if needed.
As proven above, signed and unsigned multiplication each use the identical microcode, however the microcode conditionally calls some subroutines for
signed multiplication.
I’ll talk about these micro-subroutines under.

PREIMUL

The primary subroutine for signed multiplication is PREIMUL, performing preliminary operations for integer multiplication.
It converts the 2 arguments, saved in tmpC and tmpB, to constructive values.
It retains monitor of the indicators utilizing an inside flag known as F1, toggling this flag for a adverse argument.
This conveniently handles the rule that two negatives make a constructive since complementing the F1 flag twice will clear it.

This microcode, under, illustrates the complexity of microcode and the way micro-operations are rigorously organized to get the correct values on the proper time.
The primary micro-instruction performs one ALU operation and units up a second operation.
The calling code had arrange the ALU to carry out LRCY tmpC, so that is the end result returned by Σ (and discarded).
Performing a left rotate and discarding the end result could seem pointless, however the vital side-effect is that the highest bit
(i.e. the signal bit) leads to the carry flag.
The microcode doesn’t have a conditional leap based mostly on the signal, however has a conditional leap based mostly on carry, so the purpose is
to check if tmpC is adverse.
The primary micro-instruction additionally units up negation (NEG tmpC) for the subsequent ALU operation.

Σ → no dest  NEG tmpC   PREIMUL: arrange negation of tmpC
             JMPS NCY 7  leap if tmpC constructive
Σ → tmpC     CF1         if adverse, negate tmpC, flip F1
             JMPS 7      leap to shared code

             LRCY tmpB  7:
Σ → no dest  NEG tmpB    arrange negation of tmpB
             JMPS NCY 11 leap if tmpB constructive
Σ → tmpB     CF1 RTN     if adverse, negate tmpB, flip F1
             RTN        11: return

For the remaining traces,
if the carry is obvious (NCY), the following two traces are skipped. In any other case, the ALU end result (Σ) is written to tmpC, making it constructive,
and the F1 flag is complemented with CF1. (The second quick leap (JMPS) might look pointless, however I reordered the code for readability.)
The second half of the microcode performs an analogous check on tmpB. If tmpB is adverse, it’s negated and F1 is toggled.

NEGATE

The microcode under known as after computing the end result, if the end result must be made adverse.
Negation is more durable than you would possibly anticipate as a result of the result’s break up between the tmpA and tmpC registers.
The 2’s complement operation (NEG) is utilized to the low phrase, whereas both 2’s complement or one’s complement (COM1) is utilized to
the higher phrase, relying on the carry for mathematical causes.9
The code additionally toggles F1 and makes tmpB constructive; I feel this code is barely helpful for division, which additionally makes use of the NEGATE subroutine.

             NEG tmpC   NEGATE: negate tmpC
Σ → tmpC     COM1 tmpA F perhaps complement tmpA
             JMPS CY 6  
             NEG tmpA    negate tmpA if there is no carry
Σ → tmpA     CF1        6: toggle F1 for some motive

             LRCY tmpB  7: check signal of tmpB
Σ → no dest  NEG tmpB    perhaps negate tmpB
             JMPS NCY 11 skip if tmpB constructive
Σ → tmpB     CF1 RTN     else negate tmpB, toggle F1
             RTN        11: return

IMULCOF

The IMULCOF routine is just like MULCOF, however the calculation is a bit trickier for a signed end result.
This routine units the carry and overflow flags if the higher half of the result’s important, that’s, it’s not
simply the signal extension of the decrease half.10
In different phrases, the highest byte shouldn’t be important if it duplicates the highest bit (the signal bit) of the decrease byte.
The trick within the microcode is so as to add the highest little bit of the decrease byte to the higher byte by placing it within the carry flag
and performing an add with carry (ADC) of 0.
If the result’s 0, the higher byte shouldn’t be important, dealing with the constructive and adverse instances.
(This additionally holds for phrases as a substitute of bytes.)

See Also

ZERO → tmpB  LRCY tmpC  IMULCOF: get prime little bit of tmpC
Σ → no dest  ADC tmpA    add to tmpA and 0 (tmpB)
Σ → no dest   F          replace flags
             JMPS Z 8   12: leap if zero end result
             SCOF RTN    in any other case set carry and overflow

             CCOF RTN   8: clear carry and overflow

The {hardware} for multiplication

For essentially the most half, the 8086 makes use of the common ALU addition and shifts for the multiplication algorithm. Some particular {hardware}
options present help.

Loop counter

The 8086 has a particular 4-bit loop counter for multiplication. This counter begins at 7 for byte multiplication and 15 for phrase multiplication,
based mostly on the instruction.
This loop counter permits the microcode to decrement the counter, check for the top, and carry out a conditional department in a single micro-operation.
The counter is applied with 4 flip-flops, together with logic to compute the worth after decrementing by one.
The MAXC (Most Depend) micro-instruction units the counter to 7 or 15 for byte or phrase operations respectively.
The NCZ (Not Counter Zero) micro-instruction has two actions. First, it performs a conditional leap if the counter is nonzero.
Second, it decrements the counter.

X register

The multiplication microcode makes use of an inside register known as the X register to tell apart between the MUL and IMUL directions.
The X register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.11
For the reason that instruction is held within the Instruction Register, you would possibly marvel why a separate register is required.
The motivation is that some opcodes specify the kind of ALU operation within the second byte of the instruction, the ModR/M byte, bits 5–3.12
For the reason that ALU operation is typically specified within the first byte and typically within the second byte, the X register was added to deal with
each these instances.

For essentially the most half, the X register signifies which of the eight customary ALU operations is chosen (ADD, OR, ADC, SBB, AND, SUB, XOR, CMP).
Nevertheless, a couple of directions use bit 0 of the X register to tell apart between different pairs of directions.
As an example, it distinguishes between MUL and IMUL, DIV and IDIV, CMPS and SCAS, MOVS and LODS, or AAA and AAS.
Whereas these instruction pairs might seem to have arbitrary opcodes, they’ve been rigorously assigned.
The microcode can check this bit utilizing the X0 situation and carry out conditional jumps.

The implementation of the X register is simple, consisting of three flip-flops to carry the three bits of the instruction.
The flip-flops are loaded from the prefetch queue bus throughout First Clock and through Second Clock for acceptable directions, because the
instruction bytes journey over the bus.
Testing bit 0 of the X register with the X0 situation is supported by the microcode situation analysis circuitry, so it may be used for conditional jumps within the microcode.

The F1 flag

The multiplication microcode makes use of an inside flag known as F1,13 which has two distinct makes use of.
The flag retains monitor of a REP prefix to be used with a string operation.
However the F1 flag can also be utilized by signed multiplication and division to maintain monitor of the signal.
The F1 flag may be toggled by microcode via the CF1 (Complement F1) micro-instruction.
The F1 flag is applied with a flip-flop, together with a multiplexer to pick out the worth. It’s cleared when a brand new instruction begins,
set by a REP prefix, and toggled by the CF1 micro-instruction.

The diagram under exhibits how the F1 latch and the loop counter seem on the die. On this picture, the steel layer has been eliminated, displaying the
silicon and the polysilicon wiring beneath.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

The counter and F1 latch as they seem on the die. The latch for the REP state can also be right here.

Later advances in multiplication

The 8086 was fairly sluggish at multiplying in comparison with later Intel processors.14
The 8086 took as much as 133 clock cycles to multiply unsigned 16-bit values because of the difficult microcode loops.
By 1982, the Intel 286 processor lower this time right down to 21 clock cycles.
The Intel 486 (1989) used an improved algorithm that might finish early, so multiplying by a small quantity may take simply 9 cycles.

Though these optimizations improved efficiency, they nonetheless relied on looping over the bits.
With the shift to 32-bit processors, the loop time turned unwieldy.
The answer was to interchange the loop with {hardware}: as a substitute of performing 32 shift-and-add loops,
an array of adders may compute the multiplication in a single step.
This amount of {hardware} was unreasonable within the 8086 period, however as Moore’s legislation made transistors smaller and cheaper, {hardware} multiplication turned
sensible.
As an example, the Cyrix Cx486SLC (1992) had a 16-bit {hardware} multiplier that lower phrase multiply down to three cycles.
The Intel Core 2 (2006) was even quicker, capable of full a 32-bit multiplication each clock cycle.



{Hardware} multiplication is a reasonably difficult topic, with many optimizations to maximise efficiency whereas minimizing {hardware}.15
Merely changing the loop with a sequence of 32 adders is simply too sluggish as a result of the end result can be delayed whereas propagating via all of the adders.
The answer is to rearrange the adders as a tree to supply parallelism. The primary layer has 16 adders so as to add pairs of phrases. The following layer provides pairs of those partial
sums, and so forth. The ensuing tree of adders is 5 layers deep slightly than 32, decreasing the time to compute the sum.
Actual multipliers obtain additional efficiency enhancements by splitting up the adders and making a extra advanced tree:
the venerable Wallace tree (1964) and Dadda multiplier (1965) are
two fashionable approaches.
One other optimization is the Booth algorithm (1951), which performs signed
multiplication instantly, with out changing the arguments to constructive values first.
The Pentium 4 (2000) used a Sales space encoder and a Wallace tree
(ref), however analysis in
the early 2000s discovered the Dadda tree is quicker and it’s now extra fashionable.

Conclusions

Multiplication is way more durable to compute than addition or subtraction. The 8086 processor hid this complexity from the programmer
by offering 4 multiplication directions for byte and phrase multiplication of signed or unsigned values.
These directions had been applied multiplication in microcode, performing shifts and provides in a loop.
By utilizing microcode subroutines and conditional execution, these 4 machine directions share a lot of the microcode.
Because the microcode capability of the 8086 was very small, this was a crucial characteristic of the implementation.

In case you made it via all of the dialogue of microcode, congratulations!
Microcode is even more durable to know than meeting code.
A part of the issue is that microcode may be very fine-grain, with even ALU operations break up into a number of steps.
One other complication is that 8086 microcode performs a register transfer and one other operation in parallel, so it is exhausting to
maintain monitor of what is going on on.
Microcode can appear a bit like a jigsaw puzzle, with items rigorously match collectively as compactly as doable.
I hope the reasons right here made sense, or at the very least gave you a really feel for a way microcode operates.

I’ve written a number of posts on the 8086 to date and
plan to proceed reverse-engineering the 8086 die so
observe me on Twitter @kenshirriff or RSS for updates.
I’ve additionally began experimenting with Mastodon lately as @[email protected].

Notes and references



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top