Bringing emulation into the twenty first century
Emulation is a captivating space of software program engineering, having the ability to carry to life a 30+ 12 months outdated arcade machine on a contemporary laptop is an extremely satisfying accomplishment. Sadly I’ve turn into more and more disillusioned with the dearth of ambition proven by these within the emulation neighborhood. While the remainder of world strikes onto cloud first, massively distributed architectures, emulation remains to be caught firmly within the twentieth century writing single threaded C++ of all issues.
This mission was born out of a need to carry the very best of contemporary design again to the the way forward for historic computing historical past.

So what can the very best of contemporary structure carry to the emulation scene?
- Scorching swappable code paths permitting for in recreation debugging
- Totally different languages for various elements
- Safe by default (mTLS on all operate calls)
- Scalability
- Fault tolerance
- Cloud native design
This culminated within the implementation of an 8080 microprocessor utilising a contemporary, containerised, microservices based mostly structure operating on kubernetes with frontends for a CP/M check harness and a full implementation of the unique Space Invaders arcade machine.
The complete mission might be discovered as a github organisation https://github.com/21st-century-emulation which comprises ~60 particular person repositories every implementing a person microservice or offering the infrastructure. This text goes into particulars on the technical structure and points I bumped into with the mission.
Key beginning factors to study extra are:
- A react based mostly 8080 disassembler operating on github pages – https://github.com/21st-century-emulation/disassembler-8080
- The CP/M check harness used to validate the processor – https://github.com/21st-century-emulation/cpm-test-harness
- Simply use
docker-compose up --build
on this repo to run the appliance
- Simply use
- House Invaders UI – https://github.com/21st-century-emulation/space-invaders-ui
- Run regionally with
docker-compose up --build
or use the next mission to deploy into kubernetes
- Run regionally with
- Kubernetes Configuration & Deploying – https://github.com/21st-century-emulation/space-invaders-kubernetes-infrastructure
- Notice that this presupposes that you’ve got entry to a kubernetes cluster which may deal with ~200 new pods
Lastly, a screenshot of the emulator in motion might be seen right here:

Architectural Overview
The next picture describes the complete archictectural mannequin as utilized to an area invaders arcade machine, the important thing elements are then drawn out within the following sections

Central Fetch Execute Loop
All old style emulators fall into one among two camps, both they step the CPU one instruction at a time after which catch up different elements or they step every part (together with the CPU) one
cycle at a time. The 8080 as performed in an area invaders cupboard beneficial properties nothing from being emulated with cycle stage accuracy so this emulator adheres to the previous design.
The Fetch Execute Loop
service is the service which then performs that core loop and is broadly formed as follows
whereas true:
Name microservice to verify if interrupts ought to happen
If so then run RST x instruction
Get subsequent instruction from reminiscence bus microservice
Name corresponding opcode microservice
That’s it. As a way to really drive this microservice we additionally present /api/v1/begin
and /api/v1/state
endpoints which correspondingly set off a brand new occasion of the CPU to run and get the standing of the presently operating CPU.
Opcode microservices
Each opcode corresponds to a microservice which should present a POST api at /api/v1/execute
taking a JSON physique formed as follows:
{
"id": "uuid",
"opcode": 123, // Present opcode used to disambiguate calls to e.g. MOV (MOV B,C or MOV B,D)
"state": {
"a": 0,
"b": 0,
"c": 0,
"d": 0,
"e": 0,
"h": 0,
"l": 0,
"flags": {
"signal": false,
"zero": false,
"auxCarry": false,
"parity": false,
"carry": false,
},
"programCounter": 100,
"stackPointer": 1000,
"cyclesTaken": 2000,
"interruptsEnabled": false,
}
}
Reminiscence bus
The reminiscence bus serves to offer the stateful storage for the service and should expose 4 routes to the opposite companies:
/api/v1/readByte?id=${cpuId}&tackle=${u16}
– Learn a single byte from the tackle handed in/api/v1/writeByte?id=${cpuId}&tackle=${u16}&worth=${u8}
– Write a single byte to the tackle handed in/api/v1/readRange?id=${cpuId}&tackle=${u16}&size=${u16}
– Learn at mostsize
bytes beginning at tackle (to get e.g. the three bytes that correspond to an instruction)/api/v1/initialise?id=${cpuId}
– POST takes a base64 encoded string as physique and makes use of that to initialise the reminiscence bus for the cpu id handed in
There’s a easy & quick implementation written in rust with a excessive tech in reminiscence database offered at https://github.com/21st-century-emulation/memory-bus-8080. Various implementations utilising persistent storage are left as an train for the reader. A blockchain based mostly backend might be the very best answer to this downside.
Interrupt service
When operating the fetch execute loop service you possibly can optionally present (through an atmosphere variable) the url to an interrupt verify service which might be known as earlier than each opcode is executed. This API should take the identical JSON physique because the opcode microservices and can return an non-compulsory worth which signifies which RST opcode is to be taken (or none if no interrupt is to be fired).
Deployment structure
While the appliance might be run regionally utilizing docker-compose
, no self-respecting cloud options architect could be glad with the dangers inherent in having all the things pinned to a single machine. Consequently this mission additionally delivers a helm chart which might be discovered here.
Provided that repository and a suitably massive kubernetes cluster (be aware: we strongly suggest selecting a high tier cloud supplier like IBM for this), all elements might be put in by merely operating ./set up.sh
.
The kubernetes structure is printed in https://github.com/21st-century-emulation/space-invaders-kubernetes-infrastructure/blob/main/README.md however a diagram is offered right here for brevity:

Efficiency
As with all fashionable design it’s essential to stick to the mannequin of “make it work then make it quick” and that’s one thing that this mission actually takes to coronary heart. In 1974 when the 8080 was launched it achieved a staggering 2MHz. Our new fashionable, containerised, cloud first design doesn’t fairly obtain that in it’s preliminary iteration. As might be seen from the screenshot above, area invaders as deployed onto an AKS cluster runs at ~1KHz which supplies us ample time for debugging however does make really enjoying it barely tough.
Nevertheless, now that the appliance works we are able to have a look at optimising it, the next are clear future instructions for it to go in:
- Rewrite extra issues in rust. As we are able to see within the picture under, a good portion of the entire CPU time was spent operating
LXI
&POP
opcodes. That is fairly comprehensible as a result ofLXI
is written in Java/Spring andPOP
is written in Scala/Play. Each are clearly orders of magnitude slower than all the opposite languages in play right here. - JSON -> Avro/Protobuf. JSON serialisation/deserialisation is thought to be too sluggish for contemporary functions, utilizing a greater binary packed format will clearly enhance efficiency
- Pipelining & speculative execution.
- A minor velocity enhance might be achieved by merely pipelining as much as the following N directions and invalidating the pipeline on any instruction which modifications this system counter. That is significantly wonderful as a result of it brings fashionable CPU design again to the 8080!
- Since all operations internally are async and wait on IO we are able to trivially execute a number of directions in parallel, an extra enhancement would due to this fact be to speculatively execute directions and rollback if the execution of a earlier one would have affected the outcome.
- Reminiscence caches
- Having to entry the reminiscence bus every time is sluggish, by noting which directions can have an effect on reminiscence we’re capable of act like a contemporary VM and cache reminiscence till a write occurs at which level we invalidate the cache and proceed. See the under picture showcasing the quantity of requests made to
/api/v1/readRange
type the fetch execute loop (which makes use of that API to get the following instruction).
- Having to entry the reminiscence bus every time is sluggish, by noting which directions can have an effect on reminiscence we’re capable of act like a contemporary VM and cache reminiscence till a write occurs at which level we invalidate the cache and proceed. See the under picture showcasing the quantity of requests made to
Implementation Particulars
One of many many lovely issues a couple of microservice structure is that, as a result of operate calls are actually HTTP over TCP, we’re now not restricted to a single language in the environment. That enables us to actually leverage the very best that fashionable http api design has to supply.
The next desk outlines the language selection for every opcode, as you possibly can see, this permits to achieve the advantages of Rusts secure integer arithmetic operations while falling again to the safety of Deno for vital operations like CALL & RET.
Opcode | Language | Description | Runtime picture measurement | Efficiency (avg latency) |
---|---|---|---|---|
MOV | Swift | Strikes information from one register to a different | 257MB | 4.68ms |
MVI | Javascript | Places 8 bits into register, or reminiscence | 118MB | 3.43ms |
LDA | VB | Places 8 bits at location Addr into A Register | 206MB | 4.56ms |
STA | C# | Shops 8 bits at location Addr | 206MB | 4.61ms |
LDAX | Typescript | Masses A register with 8 bits from location in BC or DE | 365MB | 6.22ms |
STAX | Python | Shops A register at location in BC or DE | 59MB | 5.24ms |
LHLD | Ruby | Masses HL register with 16 bits discovered at Addr and Addr+1 | 898MB! | 13.63ms |
SHLD | Perl | Shops HL register contents at Addr and Addr+1 | 930MB! | 12.68ms |
LXI | Java + Spring | Masses 16 bits into B,D,H, or SP | 415MB | 6.84ms |
PUSH | Lua | Places 16 bits of BP onto stack SP=SP-2 | 385MB | 4.42ms |
POP | Scala + Play | Takes high of stack, places it in RP SP=SP+2 | 761MB | 13.99ms |
XTHL | D | Exchanges HL with high of stack | 156MB | 26.54ms |
SPHL | F# | Places contents of HL into SP (stack pointer) | 114MB | 3.25ms |
PCHL | Kotlin | Places contents of HL into PC (program counter) [=JMP (HL)] | 445MB | 7.61ms |
XCHG | C++ | Exchanges HL and DE | 514MB | 2.16ms |
ADD | Rust | Add accumulator and register/(HL) | 123MB | 1.95ms |
ADC | Rust | Add accumulator and register/(HL) (with carry) | 123MB | 2.00ms |
ADI | Rust | Add accumulator and fast | 123MB | 2.16ms |
ACI | Rust | Add accumulator and fast (with carry) | 123MB | 2.22ms |
SUB | Rust | Sub accumulator and register/(HL) | 123MB | 1.95ms |
SBB | Rust | Sub accumulator and register/(HL) (with borrow) | 123MB | 1.70ms |
SUI | Rust | Sub accumulator and fast | 123MB | 2.15ms |
SBI | Rust | Sub accumulator and fast (with carry) | 123MB | 1.91ms |
ANA | Rust | And accumulator and register/(HL) | 123MB | 2.68ms |
ANI | Rust | And accumulator and fast | 123MB | 1.93ms |
XRA | Rust | Xor accumulator and register/(HL) | 123MB | 1.70ms |
XRI | Rust | Xor accumulator and fast | 123MB | 1.57ms |
ORA | Or accumulator and register/(HL) | 74MB | 11.36ms | |
ORI | Rust | Or accumulator and fast | 123MB | 1.40ms |
DAA | Rust | Decimal regulate accumulator | 123MB | 2.26ms |
CMP | Rust | Examine accumulator and register/(HL) | 123MB | 1.70ms |
CPI | Rust | Examine accumulator and fast | 123MB | 1.90ms |
DAD | PHP | Provides contents of register RP to contents of HL register | 430MB | 17.2 ms |
INR | Crystal | Increments register | 23MB | 1.98ms |
DCR | Crystal | Decrements register | 23MB | 2.06ms |
INX | Crystal | Increments register pair | 23MB | 2.01ms |
DCX | Crystal | Decrements register pair | 23MB | 1.99ms |
JMP | Powershell | Unconditional Bounce to location Addr | 294MB | 6.51ms |
CALL | Deno | Unconditional Subroutine name to location Addr | 154MB | 6.04ms |
RET | Deno | Unconditional return from subroutine | 154MB | 6.43ms |
RLC | Go | Rotate left carry | 6MB | 2.28ms |
RRC | Go | Rotate proper carry | 6MB | 2.19ms |
RAL | Go | Rotate left accumulator | 6MB | 2.39ms |
RAR | Go | Rotate proper accumulator | 6MB | 2.29ms |
IN | Information from Port positioned in A register | |||
OUT | Information from A register positioned in Port | |||
CMC | Haskell | Complement Carry Flag | 90MB | 2.50ms |
CMA | Haskell | Complement A register | 90MB | 2.54ms |
STC | Haskell | Set Carry Flag = 1 | 90MB | 2.52ms |
HLT | Halt CPU and await interrupt | |||
NOOP | C | No operation | 70MB | 1.89ms |
DI | Dart | Disable Interrupts | 79MB | 2.37ms |
EI | Dart | Allow Interrupts | 79MB | 2.21ms |
RST | Deno | Name interrupt vector | 154MB | 7.34ms |
Nim was a bit late to the occasion so solely acquired one opcode, and it nonetheless managed to be sluggish anyway.
Code particulars
Based on SCC this mission price $1M to make, which might be a number of orders of magnitude lower than Google can pay for it. Like several true fashionable software it additionally consists of considerably extra JSON/YAML than code.
───────────────────────────────────────────────────────────────────────────────
Language Recordsdata Strains Blanks Feedback Code Complexity
───────────────────────────────────────────────────────────────────────────────
YAML 144 6873 859 396 5618 0
Dockerfile 112 2007 505 353 1149 248
JSON 106 15383 240 0 15143 0
Shell 64 2448 369 195 1884 257
Plain Textual content 59 404 171 0 233 0
Docker ignore 56 295 0 0 295 0
Markdown 56 545 165 0 380 0
gitignore 56 847 129 178 540 0
C# 35 1803 198 10 1595 51
TypeScript 22 1335 116 30 1189 275
Rust 18 1825 241 7 1577 79
TOML 18 245 20 0 225 0
Java 7 306 66 0 240 1
Haskell 6 207 24 0 183 0
Visible Primary 6 119 24 0 95 0
MSBuild 5 53 13 0 40 0
Crystal 4 330 68 4 258 5
Go 4 255 33 0 222 6
JavaScript 4 147 8 1 138 5
License 4 84 16 0 68 0
PHP 4 147 21 43 83 2
Swift 4 283 15 4 264 1
C++ 3 32 4 0 28 0
Emacs Lisp 3 12 0 0 12 0
Scala 3 112 15 0 97 6
XML 3 97 13 1 83 0
C Header 2 24 2 0 22 0
CSS 2 44 6 0 38 0
Dart 2 58 6 0 52 18
HTML 2 361 1 0 360 0
Lua 2 65 8 0 57 15
Properties File 2 2 1 0 1 0
C 1 63 8 0 55 18
CMake 1 68 10 15 43 4
D 1 71 6 2 63 14
F# 1 88 11 3 74 0
Gemfile 1 9 4 0 5 0
Gradle 1 32 4 0 28 0
Kotlin 1 40 9 0 31 0
Makefile 1 16 4 0 12 0
Nim 1 82 9 0 73 2
Perl 1 49 6 3 40 4
Powershell 1 78 9 0 69 10
Python 1 37 6 0 31 1
Ruby 1 28 6 0 22 0
SVG 1 3 0 0 3 0
TypeScript Typings 1 1 0 1 0 0
───────────────────────────────────────────────────────────────────────────────
Whole 833 37413 3449 1246 32718 1022
───────────────────────────────────────────────────────────────────────────────
Estimated Price to Develop (natural) $1,052,256
Estimated Schedule Effort (natural) 14.022330 months
Estimated Individuals Required (natural) 6.666796
───────────────────────────────────────────────────────────────────────────────
Processed 1473797 bytes, 1.474 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
Points
Naturally I bumped into a lot of new points with this strategy. I’ve listed a few of these under to provide a flavour for the forms of issues with this structure:
- Github actions…
- Random timeouts logging in to ghcr.io, random timeouts pushing photographs. Usually simply spontaneous errors of all kinds. This actually drove house how enjoyable it’s managing the event toolchain for a microservice structure
- Haskell compiler photographs
- Oh boy. Haskell gained the award for my least favorite improvement atmosphere solely off the again of the absurd 3.5GB SDK picture! That was sufficiently massive that it was unattainable to construct the haskell based mostly companies in CI with out superb tuning the picture all the way down to < 3.4GB (github actions limits)
- Intermittent AKS networking errors
- While it achieved ~4 9s availability throughout all microservices, there had been spontaneous 504s between microservices within the AKS implementation.
- On the plus facet, as a result of we’re utilizing linkerd as a service mesh to provide us safe microservice TCP connections we are able to additionally simply leverage it’s retry conduct and neglect about the issue! Precisely like a contemporary structure!
- DNS caching (or not)
- Solely node.js of all of the languages used had points the place it could hammer the DNS server on actually each HTTP request, ultimately DNS informed it to piss off and the following request broke #justnodethings
- Logging at scale
- I initially arrange Loki because the logging backend as a result of it’s new and due to this fact good, however discovered that the C# libraries for Loki would often ship requests out of order and that in the long run Loki would simply hand over and cease accepting logs – thankfully fluentd remains to be very a lot within the spirit of this mission and actually pins the mission all the way down to kubernetes so it was clearly the very best determination all alongside
- Orchestrating modifications throughout companies
- Unusually, having ~50 repositories to handle was marginally more durable than having 1. Making a change to (for instance) add an
interruptsEnabled
flag to the CPU wanted to be orchestrated throughout all microservices. Luckily I’m fairly good at writing disgusting bash scripts like all self respecting devops engineer.
- Unusually, having ~50 repositories to handle was marginally more durable than having 1. Making a change to (for instance) add an
Is that this really attainable?
Alright, in case you’ve acquired this far I’m certain you’ve realised that the entire mission is one thing of a joke. That stated it is additionally an attention-grabbing mental train to contemplate whether or not it’s remotely attainable to attain >=2MHz with the structure delivered.
The place to begin is that to attain 2MHz we should ship 1 instruction each 2μs
2MHz = 2,000,000 cycles per second
Every instruction is at 4-17 cycles so we have to handle at worst 2,000,000 / 4 = 500,000 directions per second. That offers 1/500,000 seconds = ~2μs per operation.
Because it was written there are 3 HTTP calls per instruction, one to fetch the operation, one to execute it and one to verify for interrupts.
Assuming for sake of argument that we do the next optimisations:
- Make interrupt checks off the recent path
- Cache all ROM within the fetch execute service and assume functions solely execute from ROM (true for area invaders)
- This takes us to ~1 instruction per operation
- Change from JSON w/ UTF8 encoding to sending a byte packed array of values to characterize the CPU
- Drives the request measurement all the way down to <256 bytes and eliminates all serialization/deserialization prices (simply have a struct pointer pointing on the array)
Then we are able to get to a great place to begin of precisely 1 spherical journey to/from every opcode. So what’s the minimal price for a roundtrip throughout community?
This reply (https://quant.stackexchange.com/questions/17620/what-is-the-current-lowest-possible-latency-for-tcp-communication) from 2015 benchmarks loopback gadget latency at ~2μs if the request measurement might be saved all the way down to <=256 bytes.
Assuming that individual is aware of what they’re speaking about then the fast reply is a straight no. You’ll by no means obtain the required latency throughout a community (significantly a dodgy cloud information middle community).
However let’s not hand over fairly but. We’re not miles away from the efficiency required so we are able to search for 2 occasions velocity ups.
Some ideas on strategies to get that final 2 occasions velocity up:
- Hearth and neglect reminiscence writes
- A reminiscence write is sort of by no means learn instantly, so simply chuck it on the bus and don’t hassle blocking till it’s written. Possibly you’ll lose some writes? That’s superb. Very mongo. fsync is for boring c coders and fashionable builders aren’t supposed to wish to find out about nasty complicated issues just like the CAP theorem anyway. Presumably kubernetes will resolve that for us.
- We will execute a number of operations in parallel and solely validate the correctness of their outcomes later.
- This might clearly velocity up operations like memset’s that are completed with easy
MVI (HL) d8
->DCX HL
->JNZ
sort algorithms the place every grouping might be executed in parallel
- This might clearly velocity up operations like memset’s that are completed with easy
- If every opcode was able to figuring out the following instruction then we may keep away from the second half of every spherical journey and never journey again to the fetch execute loop till the stream of directions has run out
- That is mainly a assured 2 occasions velocity up
Conclusion? I feel it may be attainable beneath some ideally suited conditions assuming ~no community latency however I’ve no intention of spending any extra time fascinated by it!