# Velocity up a program for the 50 years outdated processor by 180000% – Weblog about my engineering initiatives

*by*Phil Tadros

Within the earlier 12 months, I wrote this system, operating on Intel’s first microprocessor – 4004, that computes the first 255 digits of π. However, sadly, I used to be not in a position to beat ENIAC’s achievement with 2035 computed digits ^{. So let’s proceed our journey. I picked the successor of 4004 – Intel’s 4040. This CPU is similar to the 4004 and extends the instruction set a bit. To make my problem a bit extra spicy, I made a decision to comply with these guidelines:}

- We must always compute a minimum of 2035 digits of π quicker than ENIAC did. In line with the unique article
^{ it took 70 hours of machine operating time. So our objective is 69 hours, 59 minutes, 59 seconds or fewer.} - We are able to’t use additional I/O ports to develop CPU restrictions. As an illustration, it signifies that I can’t have any exterior RAM/ROM financial institution switchers managed by I/O ports.
- It needs to be pure computation, with no pre-computed information tables in RAM, not even a easy multiplication desk.
- Our code ought to help arbitrary values for the quantity of computed digits, whereas this worth is lower than or equal to a excessive boundary. This restriction causes avoidance of any fine-tuning for outcomes. The answer needs to be generic.

*Please, consider that this text is structured by chapters, whereas precise optimizations and adjustments have been carried out in one other order. For instance, I’ve optimized modular multiplication, then switched to normal algorithm enhancements and, as a result of it has not been sufficient, I’ve returned to work on multiplication. Nevertheless, within the article, all optimization steps for modular multiplication will likely be defined in the identical part.*

Even when I selected a more energizing processor, it nonetheless inherits nearly all of origin’s limitations:

- The CPU clock price is 740kHz, however the instruction cycle takes 8 or 16 ticks, so now we have as much as 92500 directions per second.
- Very restricted instruction set. We don’t have multiplications, divisions, or straightforward reminiscence operations. Nuff mentioned, we don’t even have shift directions – solely rotate operations.
- RAM is 1280 bytes solely. Even when the datasheet comprises details about decrease numbers, there have been actual methods with 1280 bytes on board with none advanced reminiscence swap logic
^{.}

However the excellent news is that we bought some new options with 4040 ^{:}

- Now now we have AND/OR directions. You learn it proper, 4004 had not them. Unfortunatelly, these directions will not be excellent – they carry out a binary operation with an accumulator and a selected register (
`rr6`

or`rr7`

for the`AND`

operation, as an example). - The decision stack grew from 3 nested subroutine calls to 7, so we are able to higher manage our program. Nonetheless, it’s a fairly low worth and sooner or later, I exceeded it.
- Elevated the quantity of general-purpose registers from 16 to 24. Intel added a second financial institution with 8 registers. As a result of the ISA is the mainly similar and the register index remains to be encoded as a 4-bit quantity, for entry to additional registers we have to swap between index register banks by executing particular directions:
`SB0`

/`SB1`

, which provides some inconvenience. - ROM area has been prolonged as nicely. Now now we have 2 banks of ROM with 4KiB every. Once more, switching between banks is just not clear, it’s worthwhile to name the
`DB0`

/`DB1`

instruction at a selected second.

There are different adjustments introduced by 4040, however they aren’t that worthwhile for my goal: interrupts, single-step debugging, and separate voltage sources to cut back energy consumption.

## Algorithm alternative

Let’s re-evaluate what algorithms for π calculation exist.

**Spigot algorithm**^{. I’ve already described that algorithm in my earlier article. It requires conserving an array of digits in reminiscence. The size of the array is proportional to the quantity of computed digits. To have 2000+ digits, we want way more RAM.}**Machin-like formulae**. Sensible choice, however once more reminiscence limits us. We have to carry out lengthy arithmetic and the quantity of digits for operands is comparable with the specified quantity of π digits. Possibly it will be doable to compute 700 or 800 digits, however there is no such thing as a approach we are able to attain our goal worth.**Improved spigot algorithm**^{. We don’t have to maintain any arrays, however now we have related concerns as for Machin-like formulation – attributable to computations we work with a number of huge numbers. I used to be not lazy and checked how huge they had been – only for 255 digits of π, the algorithm requires 7KiB of RAM.}**Bailey–Borwein–Plouffe components**^{. Attention-grabbing concept, nevertheless it computes π digits in base 16 or base 2. We want base 10.}**Plouffe’s algorithm**^{. The unique algorithm is just not very sensible (too gradual).}**Fabrice Bellard’s algorithm relies on the identical components for π**^{. Easy and easy. Not the quickest, however significantly better than the model of Simon Plouffe.}**Xavier Gourdon’s algorithm**^{. It’s even quicker than Bellard’s concept, nevertheless it’s extra advanced and requires extra code to be written, which is problematic with our poor ISA and restricted program dimension (lower than 8000 directions).}

From the checklist above, Fabrice Bellard’s algorithm appears like an apparent alternative, so let’s analysis how we are able to write this system, based mostly on his brief paper.

## Authentic components

A really fascinating components has been used as a foundation for Plouffe’s algorithm:

pi+3=sum_{okay=1}^{+infty}frac{k2^{okay}}{left(start{array}{c}2k kend{array}proper)}

You’ll be able to take a look at a proof (and different fascinating equations) in Lehmer’s article. ^{}

The spherical brackets expression is a central binomial coefficient and might be specified as:

left(start{array}{c}2n nend{array}proper)=frac{2^n(2n-1)!!}{n!},:n!!=n cdot (n-2) cdots 3 cdot 1

So the ultimate numerical collection to work with can be:

pi+3=sum_{okay=1}^{+infty}frac{k2^{okay}}{left(start{array}{c}2k kend{array}proper)}=sum_{okay=1}^{+infty}frac{okay cdot okay!}{(2k-1)!!}

## Computation of the nth digit for a fraction

Let’s begin with the simplified drawback. We’ve got the fraction s = frac{b}{A}, and we need to discover its nth digit. How would we do this?

At first, we specific A as prime factorization:

A=prod_{i=1}^{m}a_{i},:i neq j Rightarrow gcd(a_i,a_j)=1

Then we are able to declare just a few variables:

b_i = b bmod a_i; A_i = frac{A}{a_i}; A_i^{-1} = frac{1}{A_i} bmod a_i

By making use of a Chinese remainder theorem, we bought:

b’ = left( sum_{i=1}^{m} b_i A_i A_i^{-1} proper) bmod{A}

Introduce one other variable f_i = b_i cdot A_i^{-1} bmod a_i:

b’ = left( sum_{i=1}^{m} f_i A_i proper) bmod{A}

As a result of b’ = b bmod A, we are able to use it for the fractional a part of s:

{s} = frac{b’}{A} = frac{left( sum_{i=1}^{m} frac{f_i}{a_i} A proper) bmod{A}}{A} stackrel{(1)}{=} leftlbrace frac{sum_{i=1}^{m} frac{f_i}{a_i} A}{A} rightrbrace = leftlbrace sum_{i=1}^{m} frac{f_i}{a_i} rightrbrace

It needs to be apparent how (1) works, however for completeness, the brief proof is right here:

frac{x bmod N}{N} = frac{x – N lfloor frac{x}{N} rfloor}{N} = frac{x}{N} – lfloor frac{x}{N} rfloor = {frac{x}{N}}

Okay, now we are able to write the components to search out digits of s from place n:

D_n = {s cdot 10^n} = leftlbrace sum_{i=1}^m frac {f_i cdot 10^n}{a_i} rightrbrace stackrel{(2)}{=} leftlbrace sum_{i=1}^m frac {f_i cdot 10^n bmod {a_i}}{a_i} rightrbrace

Transition (2) is right here simply to simplify computations: we want simply the fractional half, so to keep away from huge numbers (and 10^n might be very huge) we are able to take modulus by denominator for every fractional addend.

As you’ll be able to discover, this components is just not recurrent: every time period within the sum is impartial and might be computed individually, so we have to retailer simply the present intermediate outcome. And due to modular arithmetic, the size of variables is comparatively small, therefore couple of dozens of bytes is sufficient to maintain all information.

## Computation of the nth digit for a numerical collection

However the components for π is just not an everyday fraction, it’s represented as a numerical collection, so we have to adapt math manipulations from above to a different type – the partial sum of optimistic finite collection. We’ve got another complication: A_k (denominator of collection) can be factorized to prime elements with multiplicity that might be greater than 1.

Let’s declare a set of prime elements (a_1, a_2, …,a_m) which have all potential elements of A_k and for some (okay, i) multiplicity of a_i within the factorization of A_k might be 0 (that specific quantity from the set is just not an element for specific A_k). Additionally, we introduce v_i^{max} – the maximal multiplicity of a_i between all factorizations of A_k.

S_N=sum_{okay=1}^Nfrac{b_k}{A_k}, A_k=prod_{i=1}^{m}a_i^{v_{i,okay}}, 0 leq v_{i,okay} leq v_i^{max},:i neq j Rightarrow gcd(a_i,a_j)=1, v_{i,okay} > 0 Rightarrow a_i nmid b_k

Declare variables, just like the earlier chapter:

A_{i,okay} = frac {A_k}{a_i^{v_{i,okay}}}

A_{i,okay}^{-1} = frac{1}{A_{i,okay}} bmod a_i^{v_{i,okay}}

f_{i,okay} =

start{instances}

0 & textual content{if }v_{i,okay}=0

b_k cdot A_{i,okay}^{-1} bmod a_i^{v_{i,okay}} & textual content{if }v_{i,okay} > 0

finish{instances}

You’ll be able to discover that we’re utilizing b_k as a substitute of b_k bmod a_i^{v_{i,okay}}, which aren’t equal even with modular multiplication. However we are able to do this as a result of on the finish we want simply the fractional a part of a quantity, so any integer elements we are able to drop early and work with numbers by modulus a_i^{v_{i,okay}}.

By repeating logic from the earlier part with the Chinese language the rest theorem, we bought the expression for the time period of collection and the entire partial sum:

frac{b_k}{A_k} = leftlbrace sum_{i=1}^{m} frac{f_{i,okay}}{a_i^{v_{i,okay}}} rightrbrace [15px]

S_N = leftlbrace sum_{okay=1}^N sum_{i=1}^{m} frac{f_{i,okay}}{a_i^{v_{i,okay}}} rightrbrace = leftlbrace sum_{i=1}^{m} sum_{okay=1}^N frac{f_{i,okay}}{a_i^{v_{i,okay}}} rightrbrace

Subsequent step is to make use of widespread divisor a_i^{v_i^{max}}, that’s impartial from okay:

f’_{i,okay} = f_{i,okay} cdot a_i^{v_i^{max}-v_{i,okay}} [15px]

S_N

= left{ sum_{i=1}^{m} sum_{okay=1}^N frac{f’_{i,okay}}{a_i^{v_{i,okay}} cdot a_i^{v_i^{max}-v_{i,okay}}} proper}

= left{ sum_{i=1}^{m} sum_{okay=1}^N frac{f’_{i,okay}}{a_i^{v_i^{max}}} proper} [15px]

S_N = left{ sum_{i=1}^m frac{f_i}{a_i^{v_i^{max}}}proper}, f_i=left( sum_{okay=1}^N {f’_{i,okay}} proper) bmod a_i^{v_i^{max}}

The identical trick once more – as a result of the integer half is just not necessary, we are able to use numbers from residue class modulo a_i^{v_i^{max}}, which aren’t that enormous.

Now necessary second – we are able to deduce that each f’_{i,okay} additionally belongs to the identical residue class modulo a_i^{v_i^{max}}, even when it’s not that apparent. We are able to show that with a simplified instance:

(x bmod n^p) cdot n^{q-p} = (x – n^p cdot lfloor frac{x}{n^p} rfloor) cdot n^{q-p} = x cdot n^{q-p} – n^q cdot lfloor frac{x}{n^p} rfloor

(x cdot n^{q-p}) bmod n^q = x cdot n^{q-p} – n^q cdot lfloor frac{x cdot n^{q-p}}{n^q} rfloor = x cdot n^{q-p} – n^q cdot lfloor frac{x}{n^p} rfloor

Based mostly on that, we are able to write f_i as (by increasing and making use of reasoning above to f’_{i,okay} expression):

f’_{i,okay} = b_k cdot A_{i,okay}^{-1} bmod a_i^{v_{i,okay}} cdot a_i^{v_i^{max}-v_{i,okay}} = b_k cdot A_{i,okay}^{-1} cdot a_i^{v_i^{max}-v_{i,okay}} bmod a_i^{v_i^{max}}[15px]

f_i=sum_{okay=1}^Nb_kleft(frac{A_k}{a_i^{v_{i,okay}}}proper)^{-1}a_i^{v_i^{max}-v_{i,okay}}bmod a_i^{v_i^{max}}

And at last, we are able to derive a components for digits of partial sum, ranging from place n:

D_n=left{ sum_{i=1}^mfrac{f_i cdot 10^n bmod a_i^{v_i^{max}}}{a_i^{v_i^{max}}} proper}

## Figuring out most multiplicity of a chief quantity in A_k

We are able to assign values to variables into the components for digits of partial sum (you’ll be able to discover that okay issue has been extracted from b_k calculation, it’s carried out for simpler algorithm computation):

b_k = okay!

A_k = (2k-1)!!

v_{i,okay} = v_{a_i}({A_k})

u_{i,okay} = v_{a_i}({b_k})

f_i=sum_{okay=1}^N kcdot left( frac{b_k}{a_i^{u_{i,okay}}} proper) cdot left( frac{A_k}{a_i^{v_{i,okay}}} proper)^{-1} cdot a_i^{v_i^{max}-v_{i,okay}+u_{i,okay}} bmod a_i^{v_i^{max}}

The fascinating query is how we might discover the worth of v_i^{max} for specific a_i. From the issue definition from the earlier part, we declare that a_i doesn’t divide b_k, nevertheless it’s not true for our case, as a result of okay! might be divisible by a_i very simply. Thankfully, it impacts solely v_i^{max} worth. So to find out it, we have to discover:

v_{a_i}left(frac{A_k}{b_k} proper) = v_{a_i}left(frac{(2N – 1)!!}{N!}proper)

v_{a_i} is function of multiplicity. We are able to expand double factorial and use Legendre’s formula to calculate it:

v_{a_i}left(frac{(2N – 1)!!}{N!}proper)

= v_{a_i}left( frac{(2N)!}{2^N cdot N! cdot N!} proper)

= v_{a_i}left( frac{(2N)!}{N! cdot N!} proper)

= sum_{t=1}^L lfloor frac{2N}{a_i^t} rfloor – 2sum_{t=1}^L lfloor frac{N}{a_i^t} rfloor

= sum_{t=1}^L left( lfloor frac{2N}{a_i^t} rfloor – 2 lfloor frac{N}{a_i^t} rfloor proper)

We might drop 2^N as a result of it’s divisible solely by 2. And if we look at phrases of that sum, then we are able to see that every specific time period is 0 or 1:

lfloor frac{2N}{a_i^t} rfloor – 2 lfloor frac{N}{a_i^t} rfloor

= lfloor frac{N}{a_i^t} rfloor + lfloor frac{N}{a_i^t} + frac{1}{2} rfloor – 2 lfloor frac{N}{a_i^t} rfloor

= lfloor frac{N}{a_i^t} + frac{1}{2} rfloor – lfloor frac{N}{a_i^t} rfloor

leq 1

As a result of now we have L = lfloor log_{a_i}{2N} rfloor, then v_i^{max} leq lfloor log_{a_i}{2N} rfloor.

## Discover the higher restrict for the partial sum of π collection

The following query is what specific N ought to we use for the partial sum to get the specified fractional digits. We are able to formalize our demand:

frac{b_M}{A_M} cdot 10^n < 1

As a result of we decide solely fractional half, then we have to consider phrases of collection, that are decrease than 1 when they’re multiplied by 10^n. Let’s attempt to remedy that inequality. We’d use one other components for the central binomial coefficient (it’s straightforward to derive it from the unique assertion):

left(start{array}{c}2n nend{array}proper)=frac{4^n(2n-1)!!}{(2n)!!}

And based mostly on that, we are able to remodel our π collection time period:

frac{M cdot 2^M}{left(start{array}{c}2M Mend{array}proper)} cdot 10^n

= frac{M cdot 2^M}{left( frac{4^M cdot (2M – 1)!!}{2M!!} proper)} cdot 10^n

= frac{M cdot 2M!!}{2^M cdot (2M – 1)!!} cdot 10^n

We are able to restrict a part of the fraction:

1 < frac{2M!!}{(2M – 1)!!} leq M

When you recall the definition of x!! operator, it will be easy to see why that half is lower than M with a few exceptions (when M is lower than 4). Quickly sufficient we might swap to logarithms, so let’s use levels:

1 < r leq 2 [15px]

frac{M^{r} cdot 10^n}{2^M} < 1

Now we are able to calculate an estimation of M:

frac{M^r cdot 10^n}{2^M} < 1

Rightarrow M^r cdot 10^n < 2^M

Rightarrow log_{2}{M^r} + log_{2}{10^n} < M

Rightarrow r cdot log_{2}{M} + n cdot log_{2}{10} < M

It’s doable to unravel it, however we don’t want to do this, as a result of anyway, we have to have some epsilon to retrieve extra digits than 1 and to guarantee that the fractional half has the required precision. So we are able to simply add r cdot log_{2}{M} to that epsilon. For our necessities (n leq 2048), that addition might be a small quantity, and complete epsilon might be as small as 50. I picked 70 to have some spare room for imprecision in calculations with float numbers.

## Algorithm

Simply implement two loops: outer loop over prime numbers and internal loop to calculate f_i for specific a_i. There may be necessary optimization – as a substitute of computing factorials on every iteration, we use recurrent formulation: A_k = (A_{k-1} cdot (2k – 1)) bmod m and b_k = (b_{k-1} cdot okay) bmod m. Additionally, we monitor present details about what number of occasions we divide A_k and b_k by a_i to know the worth of v_{i,okay} and u_{i,okay} that will be used for exponentiation of a_i.

start{align*}

&start{aligned}

&N&&=lfloor(n+epsilon) cdot log_{2}{10}rfloor

&D_n&&=0

finish{aligned}

&for:(a in mathbb{P} | 2 < a <2N):{

&quadbegin{aligned}

&vmax &&=lfloor log_{a}{2N} rfloor

&m&&=a^{vmax}

&v&&=0

&u&&=0

&f&&=0

&A&&=1

&b&&=1

finish{aligned}

&quad for:(okay=2..N):{

&quadquadbegin{aligned}

&b &&= left( frac{okay}{a^{v(a,okay)}} cdot b proper) bmod m

&A &&= left( frac{2k-1}{a^{v(a,2k-1)}} cdot A proper) bmod m

&u &&= u + v(a, okay)

&v &&= v + v(a, 2k-1)

finish{aligned}

&quadquad if:(v – u > 0):{

&quadquadquad f = (f + okay cdot b cdot A^{-1} cdot a^{vmax-v+u}) bmod m

&quadquad }

&quad }

&quadbegin{aligned}

&d&&=(f cdot 10^n) bmod m

&D&&=left{ D + frac{d}{m} proper}

finish{aligned}

&}

finish{align*}

I’m utilizing JavaScript for a reference implementation as a result of it’s simpler to translate the working HLL code to meeting, moderately than writing it from scratch:

const getNinePiDigits = (n) => { const N = Math.trunc((n + 20) * Math.log(10) / Math.log(2)); let digits = 0; for (const a of primes.filter((prime) => prime > 2 && prime < 2 * N)) { const vmax = Math.trunc(Math.log(2 * N) / Math.log(a)); const m = a ** vmax; let f = 0, A = 1, b = 1, v = 0, u = 0; for (let okay = 2; okay <= N; okay++) { b = (b * (okay / (a ** multiplicity(a, okay)))) % m; A = (A * ((2 * okay - 1) / (a ** multiplicity(a, 2 * okay - 1)))) % m; u += multiplicity(a, okay); v += multiplicity(a, 2 * okay - 1); if (v - u > 0) { f = (f + (okay * b * modularInverse(A, m) * (a ** (vmax - v + u)))) % m; } } d = (f * modularPower(10, n, m)) % m; digits = (digits + (d / m)) % 1; } return Math.trunc(digits * 1e9).toString().padStart(9, '0'); };

It has been sufficient to have only a compiler from meeting to bytecode for the comparatively easy program from my previous article. However from some level, it turned increasingly tedious to develop code. So I had to enhance my dev expertise and prolong the intel 40xx tool stack with just a few extra instruments.

## Pre-processor

I needed to have two options:

- means to separate supply code by sub-modules
- variables substitutions (a-la macroses)

To help that I’ve added particular directives and a pre-processor that searches just for these directives and outputs a single file with supply code, that might be utilized by the compiler (and that might be pasted to the emulator UI).

Right here is an instance of a module, that implements some operate (`foo.i4040`

):

%outline FOO_PARAM1_VAL1 0x5 %outline FOO_PARAM1_VAL2 0xA %outline FOO_PARAM2_VAL1 D foo: ... BBL 0

And module, that makes use of operate `foo`

:

%embody "foo.i4040" bar: FIM r0, $FOO_PARAM1_VAL1 . $FOO_PARAM2_VAL1 JMS foo FIM r0, $FOO_PARAM1_VAL2 . 0 JMS foo BBL 0

Macroses had been very helpful for reminiscence group. I’ve re-arranged the reminiscence format a number of occasions and adjusted variable placements in reminiscence (for instance, moved a variable that retains N from reminiscence register #5 to register #9). With out pre-processor I must seek for numerical values and verify if it’s associated to reminiscence operation or simply common numeral literal.

## Compiler and linker

The primary model of the compiler was quite simple – sequentially picks an instruction, encodes it to bytecode, and appends it to the ROM picture. When the entire bytecode is shaped, the compiler updates references for soar instruction if labels have been used (right here and later soar directions embody name instruction). However the Intel 40xx CPU has just a few peculiarities about bytecode format:

- Some soar directions are brief jumps. It signifies that they will solely soar to areas, which can be inside the identical ROM web page (0x100 bytes) because the instruction itself. For instance, it’s doable to make use of such directions for jumps from
`0x2AA`

to`0x255`

, however not doable to`0x655`

. - Brief jumps have one other specialty – they shouldn’t be positioned on the final bytes of the ROM web page. If the soar instruction is positioned at handle
`0x7FF`

or`0x7FE`

and tries to carry out a brief soar to`0x7BB`

, the precise soar can be carried out to`0x8BB`

due to how the CPU works.

With the preliminary compiler, I solved these issues both by handbook association of routines or by padding problematic features with `NOP`

directions to repair misalignments. It labored whereas the code dimension was not that enormous, however the code bloated, and doing all these refinements turned a nightmare.

Different concerns for higher multi-stage compilation had been:

- Intel 4040 began to help two ROMs, and for giant packages, we need to use each banks.
- Want to have the ability to put specific routines to particular areas.

Going to spend a while and dig into particulars about that two options, required from the compiler.

As I’ve talked about earlier than, it’s not that easy to leap from one ROM financial institution to a different. You must do one thing like that:

__location(0x0:0x00) entrypoint: JMS output DB1 JMS foo HLT output: LDM 0xF WMP BBL 0 __rom_bank(0x1) foo: JMS output DB0 NOP BBL 0

We are able to see right here:

`entrypoint`

is positioned at financial institution #0, at location`0x000`

, it’s the precise first instruction, that will be executed by the CPU after RESET`foo`

is positioned at financial institution #1 at a non-specified location, the linker would select the most effective place for that block.- The
`output`

routine is shared between banks. Linker must duplicate that routine in each banks. `NOP`

instruction earlier than`BBL`

.`DBx`

instruction switches financial institution on the third cycle, so we have to spend an additional cycle earlier than the precise circulation management instruction.

We wish to have the ability to specify location not just for entry level however for swap tables. The Intel 40xx ISA has instruction `JIN rX`

, which jumps to deal with, contained by a register pair. It permits me to execute totally different code blocks, based mostly on register worth with out essential to have many comparisons. I’ve used that trick in a number of locations, right here is an instance of optimized 4bit by 4bit division, when now we have particular routines to divide operands by 1, 2, 3, …:

__location(00:0x10) div4x4_1: XCH rr12 LDM 0x0 XCH rr13 BBL 0 # INPUT: # acc - dividend # rr10 - divisor # rr11 - 0x0 # OUTPUT: # rr12 - quotient # rr13 - reminder __location(00:0x18) div4x4: JIN r5 __location(00:0x20) div4x4_2: RAR XCH rr12 TCC XCH rr13 BBL 0 ...

The compilation pipeline from supply code to ROM picture has 4 levels:

- Preprocessing. Have described that stage within the earlier paragraph.
- Parses a supply code utilizing easy grammar and produces a listing of encoded directions with metadata (like the kind of instruction or supply code line quantity) and a listing of references: if the instruction has a reference to the placement of one other instruction (jumps by label, for instance), then we have to know that to replace the reference within the remaining picture.
- Unites directions into code blocks. These blocks will not be all the time complete features, however they might be smaller items of code. Even when directions are semantically linked (implement the identical operate), they might be unfold into the ROM and don’t should be positioned close by. So the block is the smallest set of directions that might be executed sequentially and not using a assured soar (program counter change).

- Now now we have a listing of blocks and references between blocks. Based mostly on that, we type superblocks – a set of blocks which has brief jumps to one another. They need to be thought-about collectively, as a result of now we have additional limitations for his or her placements, whereas blocks with lengthy references are untangled and might be positioned in any relative areas from one another. Sadly, we are able to’t apply an algorithm that solves the classical bin packing problem, as a result of superblocks can cross ROM web page boundaries, regardless of that they’ve brief jumps:

That’s why I selected a easy and naive approach – now we have a cursor, that factors to the ROM location and goes ahead to the top of ROM. On every iteration, I attempt to insert some superblock at that cursor (and improve the cursor by superblock dimension), and if we are able to’t do it for any superblock, then transfer the cursor by 1 byte ahead. Few notes:

- Blocks with mounted areas might be positioned immediately, with none additional evaluation.
- Attempt to match the most important superblock first, normally, it’s more durable to suit giant routines.
- There might be gaps between mounted blocks, so even when the ROM cursor is transferring ahead solely, we are able to attempt to fill gaps with small blocks, earlier than performing the principle association loop.

The order of blocks contained in the superblock additionally might fluctuate and may have an effect on the opportunity of placement. So we have to verify all permutations of blocks contained in the superblock.

The quantity of permutations is n! and the superblock might have 10+ blocks, therefore checking all variants is a heavy operation that would take minutes. That’s why I’ve added a compilation cache – we all know the most effective place and permutation for superblock. If it has not been modified, then we are able to reuse that saved info.

## Emulator and profiler

I would like a solution to debug this system and its elements. It’s a lot simpler to do with correct instruments. I already had a easy Intel 4004 emulator, however I needed to prolong it a bit:

- Essential to have the ability to use the emulator exterior the browser. I wrote a bunch of assessments to validate that even primary features work accurately.
- Added help of Intel 4040 options: new directions, prolonged registers financial institution, a number of ROM banks.
- Some assessments require pre-defined RAM snapshots, so we want to have the ability to go some preliminary RAM state, as a substitute of fresh RAM banks.
- One other necessary characteristic was profiling. I needed to know, what features eat a majority of the time. With emulation, it’s doable to have instruction-based profiles, however for me it was ok to have operate scope.
- GUI began to help supply maps. We wish to have the ability to decide the supply code line, based mostly on the ROM handle of at present executed instruction. As a result of the ROM format is just not that flat anymore, the linker ought to produce a supply map.
- Additionally I’ve added watchers to the GUI – you’ll be able to declare variables for debug functions, and watch how they’re altering stay. As an illustration, we are able to have a 2-digit quantity, with a decrease digit in rr5, and a excessive digit positioned in standing character #0 from reminiscence register #1. Even when it’s doable to search out out the worth of that quantity with registers and RAM panels, it might be not very straightforward. So you’ll be able to specify watcher
`[71]:s0,rr5`

and see the worth of that quantity whereas debugging this system.

To be trustworthy, I underestimated the computation complexity of the duty. So my first implementation was comparatively dumb and had virtually no optimizations. However there are nonetheless just a few factors to say.

## Preparation

#### Quantity limits

At first, I made up my mind the bit depth for numbers. The best n is 2043, then N is lower than 7000, and modulus m is not more than 14000. It signifies that for almost all of operations 14 bits (2^14 provides a excessive boundary as 16384 for numbers) are sufficient. However after all, I need to work with CPU phrases, so our goal bit-width is 16 bits for nearly all variables. That’s good as a result of it matches to standing characters of reminiscence registers.

Let me remind you of the construction of RAM in Intel 40xx methods. We’ve got as much as 8 RAM banks. Every RAM financial institution consists of 16 reminiscence registers. Every reminiscence register is cut up into two elements: 16 foremost 4-bit characters and 4 standing 4-bit characters. An necessary distinction is how reminiscence entry is organized for various sorts of characters. Usually, entry to foremost characters is slower:

FIM r0, 0xF0 # hundreds 0xF0 to r0 SRC r0 # selects register F and foremost character 0 RDM # hundreds the chosen foremost character to the accumulator XCH rr2 # rr2 = accumulator FIM r0, 0xF1 # hundreds 0xF1 to r0 SRC r0 # selects register F and foremost character 1 RDM # hundreds the chosen foremost character to the accumulator ADD rr2 # accumulator = foremost character 0 + foremost character 1

Standing character registers might be accessed by a particular set of directions with out needed to pick a brand new character each time:

FIM r0, 0xF0 # hundreds 0xF0 to r0 SRC r0 # selects register F and foremost character 0 RD0 # hundreds standing character 0 to the accumulator XCH rr2 # rr2 = accumulator RD1 # hundreds standing character 1 to the accumulator ADD rr2 # accumulator = standing character 0 + standing character 1

That’s why storing variables inside standing characters is best than storing them in foremost characters.

OK, now I do know bit-depth and may make different estimations. Prime numbers might be computed as soon as, utilizing sieve of Eratosthenes. The tough estimate for the quantity of prime digits beneath 2N is 2000. How are we going to retailer them? We are able to use bitmap: if the bit on place x in RAM is ready to 1, it signifies that x is prime. We’ve got 10240 bits in RAM, whereas we want 14000 (excessive boundary for 2N). However we don’t care about even numbers, they’re by no means prime, besides 2, which is out of our curiosity. Therefore, we want simply 7000 bits.

#### Logarithms and float numbers

What are the opposite tough elements? Logarithms… I have to know log_{2}{10} for N computation. Even when I can calculate it simply as soon as, taking a logarithm remains to be a really onerous operation for the CPU with out float numbers and native division. Sure, I can simply use a pre-calculated actual fixed, however it will be towards the principles. As an alternative of that, I take advantage of approximation: log_2{10}approxfrac{93}{28}. Precision needs to be sufficient. We aren’t afraid of overshooting N, it will simply add extra work, however wouldn’t have an effect on correctness.

One other logarithm to guage is log_{a}{2N}. However, fortuitously, we want simply the integer half, so we are able to have a easy loop v^{max} = max {v: a^v < 2N}.

As I’ve talked about, we don’t have float numbers, however D is the fractional a part of the float quantity. I solved it by conserving that quantity in easy mounted decimal level type – simply multiply each d by 10^{15} to have 15 fractional digits. 9 digits can be our outcome and 6 digits to protect precision to keep away from cumulative rounding errors.

#### Modular arithmetic

Time to consider modular arithmetic. The best approach can be simply to carry out an operation (multiplication/exponentiation) after which divide it by modulus to get a reminder. However I made a decision to do it a bit smarter – by utilizing the binary technique. There’s a multiplication property of modular arithmetic:

(a cdot b) bmod m=(a bmod m cdot b bmod m) bmod m

Realizing that, we are able to write such recursive equation for f(x, y) = (a cdot b) bmod m:

f(x, y) =

start{instances}

0 & textual content{if }b = 0

f(2 cdot a bmod m, b / 2) & textual content{if } b textual content{ is even}

(a + f(a, b – 1)) bmod m & textual content{if } b textual content{ is odd}

finish{instances}

An analogous concept we are able to use for modular exponentiation:

a^b bmod m =

start{instances}

1 & textual content{if $b = 0$}

(a^frac{b}{2})^2 bmod m & textual content{if } b textual content{ is even}

((a^frac{b-1}{2})^2 cdot a) bmod m & textual content{if } b textual content{ is odd}

finish{instances}

The final operation is modular inversion. As a result of we are able to ensure that A and m are co-prime numbers, we are able to use Euler’s theorem:

A^{phi(m)} equiv 1 pmod m

A^{phi(m)-1} equiv A^{-1} pmod m

From definition m = a^{vmax}, so we are able to calculate Euler’s totient operate simply:

phi(m) = phi(a^{vmax}) = a^{vmax} – a^{vmax – 1}

## Outcomes

How shut that code can be to focus on 70 hours? Very, very far. I used to be not in a position to end the emulation, however I used polynomial match after computing ~800 digits to get some estimation. And it was oppressive **14.5 years**. Appears to be like like I began an extended journey to enhance that point.

## Algorithm enhancements

### “Concurrent” computations

Earlier than digging into low-level optimizations, I needed to look at the algorithm extra scrupulously. I had no enormous expectations as a result of Fabrice Bellard has a superb thoughts and possibilities that he missed some apparent level had been low. However he solved one other drawback than me – to find out the *nth* digit of π, whereas I used to be in search of *n* digits of π. Possibly we are able to reuse some computation outcomes earlier than runs? In-memory memoization cache couldn’t be giant: simply round 300 entries if we’re utilizing 16bit numbers as cache keys and cache values. We have to do one thing smarter. The important thing commentary is that N has no excessive restrict for specific n. It simply provides extra precision, however D nonetheless can be appropriate. So we are able to compute widespread f_i for all digit positions beneath n, with N computed for highest required n. And solely particular operation for various digit positions can be D =leftlbrace D + frac{(f_i cdot 10^{digitPosition}) bmod m}{m} rightrbrace.

As a result of we nonetheless would compute f_i for top N sooner or later, then additional work can be simply that pointless D updates for low digit positions after we don’t want such precision. We maintain intermediate values for D as fixed-precision numbers with as much as 15 digits after the decimal level and so they want extra space to be saved – complete 16 phrases (8 bytes). Therefore, in the most effective case, we are able to maintain 160 such numbers whereas we want 2052 / 9 = 228 entries. It’s not an enormous deal, as a result of we are able to cut up computation into two intervals (0, 1026) and (1026, 2052). After all, it will be higher to run that loop simply as soon as, however reminiscence limitations hit us once more.

Small change is required for HLL implementation:

for (let i = 0; i < digits.size; i++) { d = (f * modularPower(10, startingPosition + i * 9, m)) % m; digits[i] = (digits[i] + (d / m)) % 1; }

After making use of that concept to my meeting code I’ve been happy with **85x** enchancment – **64 days** as a substitute of a few years. However nonetheless have to make it 20x occasions quicker!

### Reminiscence format for prime numbers

We began to retailer 100+ 8-byte numbers in reminiscence and now we’re operating out of reminiscence for our prime numbers bitmap. Let’s revisit it then.

As an alternative of an everyday sieve, we are able to use segmented sieve. The concept is to have an preliminary section of prime numbers decrease than sqrt{2N} (let’s name it section dimension) and be capable of generate the next segments based mostly on that preliminary information. Phase dimension equals 118, so we all know two info in regards to the preliminary prime section: 1) prime numbers matches 1 byte (2 phrases), and a pair of) there may be lower than 30 prime quantity decrease than 118. Based mostly on that, we want simply 30 bytes for the preliminary section.

The next prime segments might be organized in a really compact method as nicely. The distinction between the primary and final prime numbers within the following section is not more than section dimension by definition. Therefore, we are able to maintain the bottom worth of the section as is (16bit quantity) and have a bitmap for 59 entries (once more we don’t want even numbers to be represented in our map). So even when we spent the entire phrase for bitmap worth it will take solely 32 bytes in whole.

### Mounted-precision numbers format

We don’t maintain D in BCD format (for instance, in that format `1234`

can be represented as `0x1234`

in reminiscence as a substitute of `0x4D2`

) due to reminiscence limitations. So it makes much less sense to have a decimal level as a substitute of a hexadecimal level. Therefore, as a substitute of multiplying D by 10^{15}, we are able to multiply it by 2^{50}, which is simply bit shift operation.

### Quick path for giant primes

I used to be nonetheless skeptical about the opportunity of reaching such efficiency acquire solely with code optimizations, so I continued to research the premise of the algorithm. And rapidly found that for primes greater than sqrt{2N} now we have v^{max} = 1. It allowed me to chop a variety of edges. v for that case has a binary nature – it’s both 0 or 1, however extra thrilling is that v alternates between two states with steady sequences. v switches to 1 when a_i | 2k-1 after which switches again to 0 when a_i | okay. Right here is a straightforward visualization of alternation.

The fundamental HLL code would seem like that:

do { // iterate till subsequent okay, that satisfies (2 * okay - 1) % a === 0 for (; (2 * okay - 1) % a !== 0; okay++) { b = (b * okay) % a; A = (A * (2 * okay - 1)) % a; } // v turns into 1 right here b = (b * okay) % a; A = (A * ((2 * okay - 1) / a)) % a; f = (f + modularInverse(A, a) * b * okay) % a; okay++; // iterate till subsequent okay, that satisfies okay % a === 0 for (; (okay % a) !== 0; okay++) { b = (b * okay) % a; A = (A * (2 * okay - 1)) % a; f = (f + modularInverse(A, a) * b * okay) % a; } // now v turns into 0 b = (b * (okay / a)) % a; A = (A * (2 * okay - 1)) % a; okay++; } whereas (okay <= N);

We already excluded exponentiation of a and plenty of pointless comparisons, however we are able to do higher than that:

- We are able to do away with division. The outer loop step is a, so
`okay / a`

and`(2 * okay - 1) / a`

increments by 1 and a pair of correspondingly for every outer loop iteration. - We don’t want actual values of okay for computations, we are able to use okay bmod a to have quicker modular multiplication with decrease numbers.
- Originally of outer loop we all know that okay bmod a = 1, therefore now we have fixed sequence of multiplications for b with v = 0. It might be calculated exterior loop: b_{v=0} = (frac{a – 1}{2} + 1)! bmod a. Ranging from okay bmod a = frac{a-1}{2} + 1 we’re switching v to be 1.

The improved (by efficiency, not by readability) model is:

let multiplierForZeroV = factorial(((a - 1) / 2) + 1) % a; let okay = ((a - 1) / 2) + 1; let reducedCoefA = 3; do { // iterate till subsequent okay, that satisfies (2 * okay - 1) % a === 0 for (reducedCoefB = 2; reducedCoefB <= (a - 1) / 2; reducedCoefB++) { A = (A * reducedCoefA) % a, reducedCoefA += 2; } b = (b * multiplierForZeroV) % a; // v turns into 1 right here A = (A * multiplierA) % a, multiplierA += 2; f = (f + modularInverse(A, a) * b * reducedCoefB) % a; okay++, reducedCoefB++; // iterate till subsequent okay, that satisfies okay % a === 0 reducedCoefA = 2; for (; reducedCoefB < a; okay++, reducedCoefB++) { A = (A * reducedCoefA) % a, reducedCoefA += 2; b = (b * reducedCoefB) % a; f = (f + modularInverse(A, a) * b * reducedCoefB) % a; } // have a verify of the loop boundary now to keep away from redundant first loop okay += (a + 1) / 2; // now v turns into 0 b = (b * multiplierB) % a, multiplierB++; A = (A * reducedCoefA) % a, reducedCoefA += 2; // attempt to maintain the issue decrease than "a" if (reducedCoefA > a) { // issue is incremented by 2, so it might be greater than "a" simply by 1 reducedCoefA = 1; } else { A = (A * reducedCoefA) % a; reducedCoefA += 2; } } whereas (okay <= N);

It was a superb commentary, nevertheless it gave me simply dozens of percents as a efficiency acquire, whereas I would like 1000’s. Time to look into primary operations.

### Profiling

Often, when you might have efficiency issues you need to decide bottle-necks and scorching code. So I began with gathering details about typical execution profiles for heavy computation routines. It was straightforward sufficient to supply stack traces, appropriate with flamegraph software.

Now I’ve a place to begin. After every optimization, I re-evaluated the profile to find out goal features, which can be price checking.

## Modular inversion

From the primary flamegraph, it was very clear that I wanted to do one thing with modular inversion. Appears to be like like a “quick” implementation based mostly on Euler’s theorem is just not that quick. After I modified the algorithm to common division-based extended Euclidean, I bought very promising numbers. However then I thought of binary version of the identical algorithm. Excessive-level code is straightforward:

const modularInverse = (a, m) => { if (a === 1) { return 1; } let rx = a, ry = m; let v = 1, u = 0; whereas (rx % 2 === 0) { rx = rx >> 1; v = (v % 2 === 0) ? (v >> 1) : (v + m) >> 1; } whereas (ry > 1) { if (ry % 2 === 0) { ry = ry >> 1; } else { if (ry < rx) { [rx, ry] = [ry, rx]; [v, u] = [u, v]; } ry = (ry - rx) >> 1; u = (u - v) % m; } u = (u % 2 === 0) ? (u >> 1) : ((u + m) >> 1); } return u < 0 ? (u + m) : u; }

Translation to meeting was additionally easy. Possibly only one fascinating level was in regards to the halving damaging numbers (all different features function solely with positives). Surprisingly, it was extra environment friendly in i40xx meeting than I assumed:

LD rr3 # load highest phrase of quantity RAL # put signal bit to hold to have the ability to shift damaging numbers LD rr3 # once more load the best phrase of the quantity RAR # shift it proper, however as a result of carry shops signal a bit # it will be propagated again to the best little bit of phrase XCH rr3 # save up to date phrase ...

I’ve not continued to develop a quicker model even when it was doable by utilizing various tricks. A few of them was straightforward sufficient to be carried out even with i40xx ISA.

## Modular exponentiation

We’re utilizing that operate in two locations: to compute 10^x when updating D and to compute a^{v^{max} – v + u}. As a result of we all know that x is growing by 9 at each iteration, we are able to keep away from exponentiation with simply 3 modular multiplication by 1000 (as a result of our multiplication routine is working with 16bit numbers we are able to’t multiply by 10^9):

let powered10 = modularPower(10, startingPosition, m); for (let i = 0; i < digits.size; i++) { const d = (powered10 * f) % m; digits[i] = (digits[i] + (d / m)) % 1; powered10 = (powered10 * 1000 * 1000 * 1000) % m; }

Exponentiation is also utilized in instances when v^{max} > 1. Even when now we all know that there will not be so many such instances we are able to enhance it as nicely. We are able to restrict v^{max} < lfloor log_{a_0}{2N} rfloor, therefore now we have an influence worth not more than 7. Furthermore, a^{v^{max} – v + u} < a^{v^{max}}, so we don’t want *modular* exponentiation in any respect. And since now we have only a most 7 values of a^t, we are able to retailer them right into a small cache desk.

## Division

After I had added particular circulation for v^{max} = 1, quantity of divisions decreased drastically, however I did division optimization earlier than that change, so I left it in place.

### 4bit x 4bit division

The muse of multi-word division is a 4bit x 4bit routine. Even naive code with a subtraction loop was not that gradual:

div4x4: FIM r6, 0x00 div4x4_loop: SUB rr11 JCN nc, div4x4_return CLC ISZ rr12, div4x4_loop div4x4_return: ADD rr11 XCH rr13 CLC BBL 0

17 cycles on common for all mixtures of dividend and divisor.

Within the chapter, associated to compiler and linker, I’ve proven an method with 15 particular code blocks for 15 divisors (divide by 0?!). Some blocks weren’t ready to slot in 0x10 bytes (the hole between two addresses in our swap desk), however due to conditional jumps inside them, it was doable to rearrange all instances on the identical ROM web page. That optimization minimize 6 cycles from every operate name. Not a lot, nevertheless it’s a 35% efficiency acquire for that small routine.

### 4bit x 4bit multiplication

One other constructing block for division is 4bit x 4bit -> 8bit multiplication. A naive implementation of 4×4 division was quick, however we are able to’t say the identical for primary 4×4 multiplication. The addition of the primary time period within the loop, counted by the second time period was not very environment friendly (~70 cycles on common). Then it’s price spending reminiscence on the multiplication desk! We’re going to use an identical trick with the swap desk to have 16 code blocks for 16 multipliers. Most of them would fetch information from reminiscence, however we are able to implement instances for 0/1/2/8 multipliers by common code to avoid wasting RAM.

__location(01:0x00) mul4x4_0: FIM r6, 0x00 # low, excessive = 0 BBL $BANK_WITH_VARIABLES # INPUT: # acc - first issue (a) # rr10 - second issue (b) # rr11 - 0x0 # OUTPUT: # rr12 - low phrase of product # rr13 - excessive phrase of product __location(01:0x08) mul4x4: JIN r5 __location(01:0x10) mul4x4_1: XCH rr12 # low = a LDM 0x0 XCH rr13 # excessive = 0 BBL $BANK_WITH_VARIABLES __location(01:0x20) mul4x4_2: RAL XCH rr12 TCC XCH rr13 BBL $BANK_WITH_VARIABLES __location(01:0x30) mul4x4_3: XCH rr12 LDM $BANK_WITH_MUL_TABLE_FOR_FACTOR_3 DCL SRC r6 RD2 XCH rr12 # low RD3 XCH rr13 # excessive BBL $BANK_WITH_VARIABLES

### 16bit x 16bit division

For multi-word division I’ve used well-known Knuth’s Algorithm ^{. However I could make it extra particular and have totally different code paths for mixtures of dividend and divisor sizes. As an illustration, there’s a case when each divisor and dividend are 3 phrases lengthy or the dividend is 2 phrases lengthy, however the divisor is simply 1 phrase. We’ve got simply 10 mixtures (divisor is all the time decrease than dividend) and solely 3 of them would require to comply with Algorithm D: 16×8, 16×12, and 12×8 divisions. In all different instances, we are able to minimize edges and carry out extra environment friendly calculations, with out speculating over estimated quotient digits.}

## Modular multiplication

After all of the enhancements from earlier chapters, modular multiplication turned the principle computation abyss – it occupied greater than 60% of execution time. And I spent a variety of time sprucing that operation. I’ve tried a number of approaches and a few optimizations have been changed by extra environment friendly options. I’d describe the ultimate type of that operate.

### Algorithm

There’s a temptation to make use of the Montgomery form to carry out modular multiplications, particularly as a result of we are able to maintain that type for our internal loop over okay and carry out conversion again simply earlier than computing chunks of digits. However attributable to Montgomery modular multiplication, we’re doing 3 common multiplications, and even with optimizations you would need to carry out frac{13n^2}{8} + n single-word multiplications^{, the place n is the quantity of phrases in quantity (n = 4 in our case). So it means 30 4×4 multiplications, and every prices 12 cycles, which supplies us an estimation for 360 cycles just for multiplications, and with 150 additions (components from the identical paper) it under-performs the present resolution.}

What about easy quick multiplication with modular discount? It nonetheless requires a minimum of 16 calls to 4×4 multiplication routines ^{ just for the multiplication step. And the discount is way more pricey. Not an answer. This paper additionally makes use of the Karatsuba algorithm for multiplication, however just for huge numbers – for small values, this algorithm is worse than simply common scanning multiplication.}

So we’re going to use simply common binary multiplication, however with a bunch of low-level hacks:

const modularMultiply = (a, b, m) => { let multipliedFactor = a; let outcome = 0; for (let shiftedFactor = b; shiftedFactor > 0; shiftedFactor = shiftedFactor >> 1) { if (shiftedFactor & 0x1 === 0x1) { outcome = (outcome + multipliedFactor) % m; } multipliedFactor = (multipliedFactor + multipliedFactor) % m; } return outcome; };

### Lazy modular discount

The concept is that we don’t have to have a `outcome`

to belong to the least residue system modulo m earlier than returning the ultimate worth. It signifies that we don’t have to do modular discount on every step. However we nonetheless have to carry out that operation to stop integer overflow (we function with 16-bit numbers). Initially, modular discount after every addition might be carried out with single subtraction: `a = a > m ? a - m : a`

, as a result of every addend was lower than m. With a lazy method, it’s not true anymore and now we have to do correct modular discount. However we are able to do it a bit smarter – overflow might probably occur with 4-word numbers solely, so we are able to create a lookup desk with pre-computed multipliers for m, that might be subtracted from the lowered quantity to make it smaller and nonetheless maintain it optimistic and congruent. After all, that discount wouldn’t be excellent, nevertheless it’s ok to stop overflows. LUT may be very easy:

x = overline{x_3 x_2 x_1 x_0}, x_3 rightarrow lfloor frac{x_3 cdot 16^{3}}{m} rfloor cdot m

That correction would give us a tough worth. However for a small modulus, it might be nonetheless an enormous divergence. We don’t need to create extra LUTs (reminiscence is just not infinite), however we are able to maintain another quantity:

lfloor frac{F00_{16}}{m} rfloor cdot m

If the lowered quantity, even after subtracting the worth from LUT remains to be greater than `0x1000`

, we are able to scale back it by that fixed. It permits us to all the time have 3-word numbers after tough discount.

By that second within the article, I’ve described all caches and tables that will be saved in RAM, so I can present you the ultimate reminiscence format:

Coloring legend:

- Inexperienced part is values for D_n. 114 registers.
- Purple part – that LUT for modulus multipliers.
- Pink part – multiplication tables. Remind you, that we want simply 2 phrases for desk entry, so every RAM financial institution can have a desk for two multipliers. 6 banks = 12 multipliers, we excluded tables 0, 1, 2, and eight as a result of we compute the results of multiplication by that elements within the code.
- Orange part – desk with pre-computed exponents for present prime quantity. Can maintain as much as 8 4-digit entries.
- Yellow part – primeness map for the present section, we use it to iterate via prime numbers.
- Blue part – the preliminary section of prime numbers. Can have as much as 32 2-digit numbers.
- White areas – free to make use of for variables.

Fairly dense format, however nonetheless higher than in my earlier work.

### Prolong laziness for the entire computation routine

Then I assumed in regards to the second when we have to have the precise worth from the least residue system modulo m. And appears like we want it solely on the final step – earlier than calculating frac{d}{m}. All different computations might be carried out with congruent numbers as a result of they’re carried out by way of modular arithmetic. It jogs my memory of Montgomery type a bit – once you convert enter values to that type, do totally different operations, and simply earlier than returning the ultimate outcome it’s worthwhile to convert it again.

Not having actual values would have an effect on efficiency in a roundabout way – values can be greater than they might, so extra pointless logic to execute. However general profit from avoiding correct modular discount is far larger.

### Addition as a substitute of subtraction

Possibly it will sound unusual, however addition requires fewer directions than subtraction, due to the inverted carry flag (borrow for subtraction). Once you subtract one quantity from one other, the borrow flag can be set if there may be *no borrow*. So if you’re doing subtraction of multi-word numbers it’s worthwhile to inverse borrow flag by utilizing additional instruction `CMC`

to let the following name of `SUB`

use the precise details about borrow from the earlier digit:

# compute [rr0, rr1] = [rr0, rr1] - [rr2, rr3] LD rr0 SUB rr2 # acc = rr0 - rr2, borrow flag is 0, if there was borrow CMC # inverse borrow flag XCH rr0 LD rr1 SUB rr3 # acc = rr1 - rr3 - borrow, borrow flag is 0, if there was borrow CMC # inverse borrow flag XCH rr3

Addition doesn’t require to appropriate carry flag:

# compute [rr0, rr1] = [rr0, rr1] + [rr2, rr3] LD rr0 ADD rr2 # acc = rr0 + rr2, carry flag is 1, if there was carry XCH rr0 LD rr1 ADD rr3 # acc = rr1 + rr3 + carry, carry flag is 1, if there was carry XCH rr3

We are able to exchange the subtraction of x with the addition of 16^3 – x (two’s complement for x) for 4-word numbers and the outcome can be the identical however with much less quantity of directions. It’s a generic commentary, nevertheless it’s most worthwhile for modular multiplication due to what number of occasions this operation is carried out.

### Swap tables once more

We’ve got not more than 16 loop iterations, so we are able to unroll the loop physique and verify every little bit of the 2nd issue:

# assume that rr6 has b[0] LDM 0x1 AN6 JCN z, mulMod_skipMultiplierBit0 # if ((b >> 0) & 0x1) JMS mulMod_updateResult # res = res + multipliedFactor mulMod_skipMultiplierBit0: JMS mulMod_updateMultipliedFactor # multipliedFactor = multipliedFactor * 2 LDM 0x2 AN6 JCN z, mulMod_skipMultiplierBit1 # if ((b >> 1) & 0x1) JMS mulMod_updateResult # res = res + multipliedFactor mulMod_skipMultiplierBit1: JMS mulMod_updateMultipliedFactor # multipliedFactor = multipliedFactor * 2 # ... 14 extra checks for specific bits

However we are able to go additional and for particular 4-bit phrases we all know the sequence of operations. As an illustration listed below are `res`

and `multpliedFactor`

updates, when the present digit of issue is 5:

__location(0x2:0x50) mulMod_processNibble_5: # 0b0101 JMS mulMod_updateResult JMS mulMod_updateMultipliedFactor JMS mulMod_updateMultipliedFactor JMS mulMod_updateResult JMS mulMod_updateMultipliedFactor JUN mulMod_updateMultipliedFactor

We are able to prolong that concept and have one other swap desk for the final digit as a result of we don’t have to maintain `multpliedFactor`

after we course of the best non-zero bit:

__location(0x3:0x50) mulMod_processLastNibble_5: # 0b0101 JMS mulMod_updateResult JMS mulMod_updateMultipliedFactor JMS mulMod_updateMultipliedFactor JUN mulMod_updateResultLast

### Carry out 3 or 4 consequential multiplied issue updates as a single operation

The final optimization was trivial – we are able to compute `16 * multipliedFactor`

and `8 * multipliedFactor`

by utilizing phrase or bit shifts. However have to verify that we might not overflow 16bit. If there may be such threat, then we have to do a fallback and name `mulMod_updateMultipliedFactor`

a number of occasions.

### Merge subroutines

Typically after the temporal outcome replace, we have to replace the multiplied issue and within the code now we have such snippets:

JMS mulMod_updateResult JMS mulMod_updateMultipliedFactor

The concept is to have a single routine `mulMod_updateResultAndFactor`

that has mixed code from `mulMod_updateResult`

and `mulMod_updateMultipliedFactor`

. However why it’s higher? Math is straightforward – we had 2 subroutine calls and a pair of returns from them, with that change we might have 1 subroutine name and 1 return. It permits us to save lots of 3 cycles for every such case. It might sound like a really marginal quantity, however please consider that `mulMod`

takes 60% of the entire execution time and it’s a comparatively small operate, so each non-executed instruction wins us a number of minutes.

### Quicker verify about potential overflow of multiplied issue

As I’ve talked about earlier than, we scale back the multiplied issue provided that it might overflow our 16-bit variable. To verify potential overflow, a quite simple reasoning is used: as a result of the multiplied issue is doubled, we are able to verify that the best nibble is all the time lower than `0x8`

XCH rr2 # multipliedFactor = multipliedFactor + multipliedFactor AN7 # verify if earlier worth of multipliedFactor[3] < 4, then new multipliedFactor[3] < 8 JCN z, mulMod_updateResultAndFactor_return # if (multipliedFactor[3] < 0x8)

It is a small and environment friendly verify – simply 3 cycles (`JCN`

instruction takes 2 machine cycles to execute). However each cycle issues, that’s why one other trick I’ve used – exchange the `AN7/JCN`

pair with single instruction `JIN`

. We occupy one other ROM web page with a swap desk. `JIN r0`

jumps to deal with, encoded in `r0`

, and if `rr0`

represents essentially the most important digit of multiplied issue, then we are able to cut up code execution into two paths: when `rr0 < 0x8`

and `rr0 >= 0x8`

. After all, it’s totally inefficient by way of ROM area, however we nonetheless had some room.

JIN r1 __location(0x4:0x00) mulMod_updateMultipliedFactor_factor0: BBL 0 __location(0x4:0x10) mulMod_updateMultipliedFactor_factor1: BBL 0 ... __location(0x4:0x80) mulMod_updateMultipliedFactor_factor8: SRC r1 RD0 ADD rr0 XCH rr0 RD1 ADD rr1 XCH rr1 RD2 ADD rr4 XCH rr4 RD3 ADD rr2 XCH rr2 CLC BBL 0 ...

### Optimizations which were faraway from the ultimate code

There have been few working concepts, however due to totally different causes that aren’t current within the code (principally as a result of different optimizations had been higher and due to ROM area):

- Particular code path for 12-bit modulus. As a result of at some second I’ve stopped to maintain variables lower than m, this separation by modulus dimension turned ineffective.
- Tough comparability of numbers with modulus. Earlier than lazy discount, we needed to verify if a quantity is greater than m to do subtraction. This comparability was pricey (as a result of it was subtraction itself and if the operand was decrease than m, this operation was only a waste of time). To carry out a tough comparability we might use one other lookup desk:

SRC rX # first register in register pair comprises the handle of LUT, second register - multipliedFactor[3] RDM # learn isLessThanModulus = LUT[multipliedFactor[3]] JCN z, ret # return if (isLessThanModulus)

However with lazy discount, we do not care anymore if the quantity can be increased than the modulus.

- One other concept was to keep away from pointless subtractions was to maintain temporal outcome as damaging. We are able to simply detect an indication of the quantity, and if it turns into optimistic, then simply subtract the modulus. Earlier than returning, all the time add m.
- The quantity of carried out operations will depend on essentially the most important one little bit of 2nd issue. So we need to use the bottom issue because the shift issue. It requires swapping operands if 2nd issue is greater than 1st. It’s good optimization, however the outcomes had been marginal – it was onerous to decide on multiplications that would use these options.
- We might have a bit finer discount even for the lazy model – add one other lookup desk for numbers lower than
`0x1000`

. But it surely requires extra RAM and helped rather a lot solely with a small modulus.

## Outcomes on emulator

In spite of everything that jiggery-pockery, the entire program completed by 69h 29m 02s and I even bought the precise digits for π. Nice success. Right here is likely one of the newest flamegraphs.

Now it is time to run it on actual {hardware}.

I used virtually the same schematic as for my earlier undertaking. The primary variations are:

- Solder the stm32 chip in place with all needed auxiliary parts, as a substitute of utilizing a stm32 board, plugged in a DIP socket into the principle board.
- Single USB-based energy supply as a substitute of exterior 15V.
- Exchange the UART interface with USB.

Aside from that design is sort of equivalent: LM339 and CD40109 are used as degree shifters.

Once more, the core of firmware has been inherited from the unique i4004-sbc board, however I needed to make just a few corrections.

Intel 4040 reminiscence interface is a little more advanced and has an additional pin to pick the ROM financial institution. That is why now we have to do extra stuff in between cycles. Moreover, the i4004-sbc board was configured to run on the bottom specified clock boundary of the Intel 4004 CPU – 625kHz, whereas I would like the best doable frequency (740kHz) to suit into my tight goal timing.

At some second I discovered that I could not use the clocking timer (stm32 timer that produces clocks for the CPU) interrupt, due to latencies. I switched to a easy pin polling inside the principle loop:

whereas ((OUT_4040_PHI1_GPIO_Port->IDR & OUT_4040_PHI1_Pin) == OUT_4040_PHI1_Pin); OUT_TEST1_GPIO_Port->ODR ^= OUT_TEST1_Pin; handleCyclePhi1Falling(); OUT_TEST1_GPIO_Port->ODR ^= OUT_TEST1_Pin;

Pin toggling is right here to trace how a lot time the stm32 spends inside a subcycle.

One other consideration was the USB interface as a substitute of UART. Interrupts needed to be disabled due i4004 program run, due to very correct timings – every subcycle is 1350ns lengthy, and stm32 needed to match into that window. With 168Mhz frequency for the stm32 core, now we have simply 200 spare stm32 cycles to learn pins, carry out needed calculations, and set output pins.

So all interplay with the PC is frozen after the `START`

command with ROM dump is acquired. I/O directions from intel 4040 are tracked and logged into the in-memory buffer. When the CPU reaches a halt state (new `HLT`

instruction), we ship the gathered buffer with output to the controlling program on the PC.

After fixing all points with {hardware} and firmware, I used to be in a position to run i4040 packages. At first, I discovered that I made a heart-breaking mistake in my calculations attributable to assessments on the emulator – I used the unsuitable coefficient to compute the estimated operating time, based mostly on cycle rely: `95000 cycles/sec`

as a substitute of `92500 cycles/sec`

. It meant that the actual operating time on the actual i4040 was only a bit greater than the goal 70 hours, so I needed to return to the supply code and discover even trickier optimizations.

Out of curiosity, I began with the smaller job: computation of the primary 255 digits of π.

[2023-10-31T12:08:05.527Z] START command has been acquired, RAM dump dimension is 8192 bytes [2023-10-31T13:13:08.988Z] Program has been completed [2023-10-31T13:13:08.994Z] Instruction cycles processed = 00000000158651E4 [2023-10-31T13:13:08.995Z] Output from i4040 (263 nibbles): [2023-10-31T13:13:08.995Z] 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9 2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 1 1 7 0 6 7 [2023-10-31T13:13:08.995Z] 9 8 2 1 4 8 0 8 6 5 1 3 2 8 2 3 0 6 6 4 7 0 9 3 8 4 4 6 0 9 5 5 0 5 8 2 2 3 1 7 2 5 3 5 9 4 0 8 1 2 8 4 8 1 1 1 7 4 5 0 2 8 4 1 0 2 7 0 1 9 3 8 5 2 1 1 0 5 5 5 9 6 4 4 6 2 2 9 4 8 9 5 4 9 3 0 3 8 1 9 [2023-10-31T13:13:08.995Z] 6 4 4 2 8 8 1 0 9 7 5 6 6 5 9 3 3 4 4 6 1 2 8 4 7 5 6 4 8 2 3 3 7 8 6 7 8 3 1 6 5 2 7 1 2 0 1 9 0 9 1 4 5 6 4 8 5 6 6 9 2 3 F

As you’ll be able to see, it took 01hr 05m 03s as compared with 03hr 31m 13s from the run of the spigot algorithm on i4004. That is not a good comparability of any type, as a result of CPU frequencies have been totally different, and extra importantly is that I spent no time on optimization within the earlier undertaking. However nonetheless was kinda curious to see that numbers.

And at last, I began computation of 2048+ digits with ETA as a bit lower than 70hrs. I’ve ready an Intel 4040 CPU with a few tiny radiators and it efficiently completed the job:

The ultimate time is 69h 28m 31s and my journey completes right here.

PS: The attentive reader might discover that the time from the {hardware} run is totally different from the emulator, even when the quantity of executed cycles is shut. It is as a result of the clock sign, generated from stm32 timers is just not 100% correct, and the output frequency is about 740.1kHz, which is increased than the desired 740kHz. So I’d say that the Intel 4040 even bought overclocked.