Operating Rust on Logic Gates
????
This text will talk about many matters, from CPU structure design to historic shenanigans. Take a drink, it is downhill from there.
Although the quantity has steadily decreased for the reason that 90s, there are nonetheless many alternative and incompatible CPU architectures in use these days. Most computer systems use x86_64 and just about all cellular gadgets and up to date Macs use some sort of ARM64-based ISA (instruction set structure).
In particular fields, although, there are extra unique ones: most routers nonetheless use MIPS (for historic causes), a roomful of builders use RISC-V, the PS3 used PowerPC, some servers 20 years in the past used Itanium, and naturally IBM nonetheless sells their S/390-based mainframes (now rebranded as z/Structure). The embedded world has much more: AVR (utilized in Arduino), SuperH (Saturn, Dreamcast, Casio 9860 calculators), and the venerable 8051, an Intel chip from 1980 which remains to be being produced, offered and even prolonged by third events.
All these architectures differ on their defining traits, the primary ones being:
- phrase dimension: 8, 16, 31, 32, 64 bits, generally extra
- design type: RISC (few directions, easy operations), CISC (many directions, performing advanced operations, VLIW (lengthy directions, doing many issues without delay in parallel)
- reminiscence structure: Harvard (separate code reminiscence and information reminiscence), von Neumann (shared)
- licensing prices: RISC-V is open and free to make use of, whereas x86 and ARM, for instance, require licensing charges
- broadly, their function set: floating-point numbers (x87), encryption (AES-NI), help for native high-level bytecode execution (Jazelle, AVR32B), vectorized computation (SSE, AVX, AltiVec)
????
That is not even counting DSP architectures, that are, to place it evenly, the ISA counterpart to the Twilight Zone (supporting bizarre arithmetic operations, peculiar information sizes, and many others).
Lots of people have constructed home made CPUs, both on real breadboards or in software program, for emulators or circuit synthesis. It is a very attention-grabbing challenge to do, even for newcomers (actually, try Ben Eater’s video sequence), as a result of it actually helps grasp how code interprets to {the electrical} alerts that energy each system we use, and the way advanced language options can actually be applied on high of easy operations.
Quite a few circumstances have led me to design a easy ARM-ish CPU in a digital circuit simulator. I initially used logisim-evolution (of which I’ve since change into a member of the event crew), and not too long ago migrated the circuit to Digital, for efficiency causes (Logisim could not simulate my circuit at greater than 50 or 60 Hz, whereas Digital reaches 20 kHz).
ARM, as a result of it helps a subset of the ARM Thumb instruction set, which itself is without doubt one of the a number of instruction units supported by ARM CPUs. It makes use of 32-bit phrases, however the instruction are 16 bits extensive.
-ish, as a result of, properly, it solely helps a subset of it (massive, however nowhere close to full), and is intentionally restricted in some points. Weird directions, such because the PUSH
/ POP
/ LDM
/ STM
household (one of many massive CISC ink blots within the RISC ARM ISA), are usually not supported and are applied as handbook load/shops by the assembler. Interrupts are usually not supported both.
Pedantically, I did not design solely a CPU however what one might name a pc; it has a ROM, a RAM, and varied gadgets that function the “entrance panel”.
Fast sidenote: gadgets
To be actually helpful, a pc won’t solely have a CPU and a reminiscence chip. It will even have peripherals and different objects plugged into it: a keyboard, a display, a disk drive, audio system, a community card; just about anything you can (or can’t) think about has already been made into a pc system.
On the finish of the day, the one factor you want is to have the ability to transmit information to and from the system. There are two reverse methods to do that: both gadgets are ✨particular✨, both they don’t seem to be.
Mainly, some architectures (x86, I am you) have, along with the reminiscence, a particular, separate tackle house for I/O, with its personal particular, completely different directions: on an 8086, you’d use MOV
to learn and write predominant reminiscence, and IN
/ OUT
to learn and write to a tool. Some gadgets (crucial ones: PS/2 controller, floppy disk, serial port, …) have a set port quantity, different have a port quantity assigned at boot by the BIOS. Within the olden days, it was widespread apply to require setting surroundings variables or writing config information to tell software program of which gadgets had been plugged had been (e.g. the well-known BLASTER config line). That is known as PMIO (port-mapped enter/output).
The opposite choice, the one utilized by just about everyone else, together with trendy x86 computer systems, is to have a single unified tackle house, however to make it digital.
Think about how IP addresses are presupposed to map your complete Web, however in actuality an tackle doesn’t have to precisely map to a single machine someplace. For instance, 127.0.0.1
(::1
in IPv6) is the native loopback, and maps to the machine you are utilizing. This isn’t required to be identified by software program speaking over the community, for the reason that mapping is completed by the community stack of the OS.
It is the identical right here: areas of the (digital) tackle house are mapped to bodily elements. To present you a real-world instance, here is the NES’s tackle house:
How one can learn: addresses from 0 to 800 (hexadecimal) are mapped to WRAM (work RAM), from 2000 to 2008 to the PPU (graphics card) management registers, from 4000 to 4018 to the APU (sound card), I belief you’ll be able to take it from there. That is known as MMIO (memory-mapped enter/output).
The massive benefit of this method, for me, is de facto its simplicity, CPU-wise: it is simply reminiscence! Take the tackle of the system’s space, learn, write, it is actually easy. It additionally makes issues simpler software-wise: you do not have to put in writing inline meeting to name particular directions; so long as you’ll be able to learn and write from pointers, you are good to go.
Along with this, reminiscence mapping may also be used to supply entry to completely different reminiscence chips (e.g. ROM and RAM). This is what it seems like for my circuit:
Discover the arrows on all sides of the perimeters between the elements and the mapper; they point out whether or not the part is read-only, learn/write or write-only.
The CPU
It is easy, actually. I imply, in comparison with actual CPUs.
As in actual Thumb, there are sixteen 32-bit registers, numbered r0
by means of r15
. The final three have nicknames: r13
is sp
(stack pointer), r14
is lr
(hyperlink register) and r15
is computer
(program counter).
Fast sidenote: stack pointer
16 is quite a bit; so in actuality they’re divided into the low (r0
–r7
) and excessive (r8
–r15
) registers. Excessive registers can solely be manipulated utilizing particular directions, so the low ones are those you are gonna use in on a regular basis life.
Directions are grouped into a number of classes, every containing directions with a standard header. I will not checklist all of them right here, however probably the most used ones are the ALU (arithmetic and logic) operations, the load/retailer directions (relative to computer
, to sp
, or to a basic register), the stack manipulation directions and the department directions (conditional and unconditional).
Instruction teams are dealt with by separate, unbiased subcircuits, that every one write into shared buses.
Fast sidenote: buses
In circuit design parlance, a bus is a gaggle of wires linked collectively, however the place just one wire emits a sign at a given on the spot. Electrically, that is permitted by means of what’s known as three-state logic: a sign is both 0, 1 or Z (pronounced “high-impedance”). Z is “weak”, within the sense that in case you join a Z sign with a 0 or 1 sign, the output can be that of the second sign. That is helpful: you’ll be able to have numerous unbiased elements, every with an “allow” enter, that solely output a sign if they’re enabled, and in any other case output Z. All these elements can then be plugged collectively, and you’ll allow one and get its output simply.
Element instance: load/retailer with register offset
That is the part that handles directions of the shape {route}R{signal}{mode} {vacation spot}, [{base}, {offset}
, with:
{direction}
: eitherLD
(load) orST
(store){sign}
: either nothing (do not extend), orS
(sign-extend the value to fill 32 bits){mode}
: either nothing (full word, 32 bits),H
(halfword, 16 bits) orB
(byte, 8 bits){destination}
: the target register, to read from/write to{base}
,{offset}
: the address in memory (which will be the sum of the values of both)
As an example, ldrh r1, [r2, r3]
is roughly equal to r1 = *(quick*)(r2 + r3)
in C code.
The directions for this group are encoded as follows:
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
0 | 1 | 0 | 1 | opcode | Ro | Rb | Rd |
opcode
is a 3-bit worth encoding each {route}
, {signal}
and {mode}
.
????
{route}
has two doable values (load, retailer); {signal}
has two (uncooked, sign-extended) and {mode}
has three (phrase, halfword, byte). That is 2⋅2⋅3 = 12 doable combos, greater than the two3 = 8 doable values for opcode
, which signifies that some combos are usually not doable.
It is because solely load operations for incomplete (halfword, byte) values may be sign-extended, so the invalid combos (strsh
, strsb
, strs
, ldrs
) are usually not given an encoding.
This is the circuit:
Listed below are the completely different unique logic elements used right here:
- the small triangles are tunnels: named wires which might be accessible wherever within the circuit;
- the massive trapezoids are multiplexers: they output the nth enter, the place n can also be an enter;
- the three-wired triangles are buffers: they output their enter if the aspect wire is excessive, in any other case they output Z (high-impedance);
- the yellow bins convert a 3-bit low register quantity right into a 4-bit register quantity (by including a zero in entrance of it);
- the big rectangles with integer ranges subsequent to them are splitters: they break up a multi-bit worth into a number of smaller values, to entry particular person bits or bit ranges
From high to backside:
- the three registers (Rd, Rb, Ro) are learn from the instruction at their respective positions (0-2, 3-5, 6-8) and despatched to the corresponding international tunnels (RW, RA, RB)
opcode
is decoded to verify whether or not it is a retailer (000
,001
,010
) or a load (remaining values)opcode
is decoded to search out the worth ofmode
: 0 for phrase, 1 for halfword, 2 for byteopcode
is but once more decoded to search out the worth ofsignal
(true just for opcodes011
and111
so we are able to verify that the final two bits are excessive)
The Instr0708 tunnel is the activation pin for this part; it is excessive if the present instruction belongs to this instruction group.
Just about all different elements appear like this one, and while you plug all of them collectively, you get a circuit that may execute directions.
Fast sidenote: reminiscence is tough
The seemingly easy drawback of manipulating information and storing it someplace with the intention to get it again later is definitely… not easy. Giving your CPU entry to a giant linear array of reminiscence cells shouldn’t be sufficient, you need to resolve what you’ll do with it. Have a look at this Python program:
print("Hi there, World!")
The place ought to the string be saved? It is gotta be someplace. What about print
? It is not an instruction, it is only a international variable that occurs to be set to an object of sort builtin_function_or_method
, you could name with the ()
operator. It is gotta be saved someplace too. Keep in mind, the one factor you actually have in the mean time is a giant array of numbers. On high of that, the one set of operations you actually have is {"load worth from tackle", "retailer worth at tackle"}
. Or is it?
The CPU speaks in meeting directions. These directions have a set, outlined encoding, and on the ARM Thumb instruction set they at all times (i.e. nearly at all times) have the identical dimension: 16 bits. Ignoring the header of the instruction (that tells you which of them one it’s), that may take up just a few bits, we shortly see that if we had been to provide tackle as immediates (fixed values, within the instruction), we could not tackle greater than 216 bytes of reminiscence.
Therefore: addressing modes and reminiscence alignment.
If we have a look at on a regular basis life applications, we are able to observe that there are two predominant use circumstances for reminiscence: storing native variables (variables in capabilities, or parameters), and storing international variables (international configuration, reminiscence that can be shared between applications).
Use case | Allocation dimension | Most lifetime | Time of allocation | Time of free |
---|---|---|---|---|
Native | Usually small | Present operate name | When getting into operate | When leaving operate |
International | Any | Static (lifetime of this system) | Any | Any |
There is a clear distinction: on one hand, “native reminiscence”, which is used for small, deterministic allocations, and “international reminiscence”, which is used for something, at any time, with only a few constraints.
How does that map to our “massive block of cells”? We’ll begin with the “international” reminiscence. We do not actually know something about the way it’ll be used, so we will not make too many assumptions. You’ll be able to ask for any quantity of bytes at any second, and provides it again to the OS any time you need, and also you often anticipate “given-back” house to be useable by subsequent allocations. That is hard. The actual-world equal could be a giant pile of stuff, laying on the bottom, ready for some program to select it up, use it, or throw it within the trash. For that reason, it is known as the heap.
Subsequent, we are able to see that the “native reminiscence” evolves in a particular manner: it grows once we enter a operate, shrinks once we exit it, and performance calls observe a stack-like sample (while you’ve entered a operate, you are able to do something you need, however you at all times find yourself exiting it sooner or later). In reality, it truly is a stack (within the algorithmic information construction that means); it is acquired two operations: push (develop) and pop (shrink). This “native reminiscence” is known as the stack.
Because it grows and shrinks that manner, we do not actually need to do any bookkeeping of the blocks of allotted reminiscence, the place they’re, what technique to make use of to decide on the place to allocate new blocks, and many others. The one piece of information we’d like is the “depth” (i.e. how deep we’re in that stack, or in different phrases, the size of the stack). The way in which that is often executed is that we set some place in reminiscence to be the start of the stack, and we maintain a worldwide variable someplace (for instance, in a register) that incorporates the place in reminiscence of the topmost merchandise of the stack: the stack pointer (on ARM, sp
, or its full identify r13
).
There’s one thing I have not specified but: which route the stack grows. Some architectures make it develop upward (push = increment stack pointer, pop = decrement), however most do it the other manner and make it develop downward. Rising it downward means you could simply make the heap begin at tackle 0 and make the stack begin at regardless of the most tackle is, and also you’re assured that they will not collide till the heap grows too far up or the stack grows too far down.
????
In reminiscence parlance, upward means rising and downward means lowering. “The stack grows downward” signifies that rising the stack decreases the stack pointer, and vice versa. Many diagrams on-line illustrate reminiscence with the tackle 0 on the high, suggesting that downward means rising, however that is deceptive.
Now that we all know how reminiscence works, how will we entry it? We have seen that we will not tackle all of it, since directions are too small, so how can we counter that?
Properly, the reply is to make use of completely different addressing modes. For instance, if you wish to entry reminiscence on the stack, you will often be accessing issues which might be on high of the stack (e.g. your native variables), so as a substitute of giving the complete reminiscence tackle (massive), you solely have to provide the gap to the info relative to the stack pointer (small). That is the sp
-relative addressing mode, and appears like ldr r1, [sp, #8]
.
Moreover, we are able to assume you’ll largely retailer issues which might be 4 bytes or greater, so we’ll say that the stack is word-aligned: all the things can be moved round in order that addresses are multiples of 4. Which means that we are able to match even greater numbers within the instruction: we solely need to retailer the offset, divided by 4. The instruction above would retailer 2, for instance.
Generally, you will wish to retailer information that’s solely helpful to a single operate. For instance, change
/ match
directions are often applied utilizing leap tables: a listing of offsets is saved in this system, and the proper offset is loaded and jumped to. Since this can be saved within the code of the operate itself, it turns into helpful to carry out reminiscence operations relative to the present place within the code, and that is how get get computer
-relative addressing: ldr r2, [pc, #16]
. As with sp
, reminiscence is word-aligned, so the offset needs to be a a number of of 4.
Fast sidenode: operate calls
The only strategy to name capabilities, in meeting, is thru the usage of jumps. You place a label someplace, you leap to it. There’s an issue although: how do you return? A operate may be known as from a number of locations, so that you want to have the ability to “bear in mind” the place the operate was known as from, and also you want to have the ability to leap to an tackle, slightly than to a identified label.
The straightforward method, like with the stack pointer earlier than, is to make use of a worldwide variable (i.e. register) to retailer the tackle of the caller, and to have a particular leap instruction that units the register to the present place (linking), so we are able to return to it later (branching). On ARM, that is the bl
(branch-link) household of directions, and that register is known as the hyperlink register (abbreviated lr
, nickname of r14
).
However there’s one other drawback: it would not work for nested calls! Should you name a operate from inside one other known as operate, the worth of the hyperlink register will get overwritten.
That is truly not a brand new drawback: different registers can get overwritten as properly while you name a operate, and you’ll’t anticipate the programmer to learn the code of each operate they’re calling to see which registers are secure and which are not. Right here comes the calling conference: on each structure (x86, ARM, …) there is a algorithm (the ABI) that inform you how all the things works, what a operate is allowed to do, and particularly what registers ought to be preserved by the callee. A preserved register shouldn’t be read-only: the callee can do no matter it desires with it, so long as when the management is given again to the caller, the previous worth is again.
The way in which to unravel that is by means of register saving. When getting into a operate, house is allotted on the stack for native variables but in addition for registers that need to be preserved, and when exiting, the unique values are pulled again from the stack into the registers.
Amongst these registers, on ARM, the hyperlink register is saved too. One cool facet of ARM particular registers being useable as general-purpose registers, is that you do not have to make use of a department instruction to leap someplace: you’ll be able to simply write into computer
!
The same old sample for capabilities in ARM meeting is thus:
my_function:
push {r4, r5, lr} ; save r4, r5 and lr
movs r4, #123 ; do stuff
movs r5, #42
pop {r4, r5, computer} ; restore the values to r4, r5 and *computer*!
Gadgets
Not a lot is required to make a circuit appear like a pc. For starters, you could wish to begin with:
- a keyboard (studying uncooked character enter);
- a terminal show (displaying characters, like a terminal emulator);
- a video show (displaying uncooked pixel information);
- a random quantity generator;
- a decimal 7-segment show;
- a community card (that may obtain and transmit information by way of TCP).
All of those are seen as addresses in reminiscence by the CPU and applications operating on it. For instance, writing a byte to handle 0xFFFFFF00
will show a personality within the terminal show. Studying a byte from tackle 0xFFFFFF18
will inform whether or not the keyboard buffer is empty or not.
The only strategy to run code on this factor is to easily write machine code and cargo it into the ROM.
This is a easy program:
movs r0, #255 ; r0 = 255 (0x000000FF)
mvns r0, r0 ; r0 = ~r0 (0xFFFFFF00, tackle of terminal)
movs r1, #65 ; r1 = 65 (ASCII code of 'A')
str r1, [r0] ; *r0 = r1
It is assembled as 20ff 43c0 2141 6001
(8 bytes), and when loaded and run, it reveals this after 4 cycles:
In fact, writing applications in meeting is not precisely sensible. We invented macro assemblers and high-level (in comparison with meeting) programming languages for this a very long time in the past, so let’s try this right here. I initially went with C, however shortly switched to Rust for the benefit of use and highly effective macro help (helpful for a constrained surroundings like this one).
Rust (technically, the reference compiler, rustc) makes use of LLVM as a backend for compilation, so any goal LLVM helps, Rust helps it to some extent. Right here, I am utilizing the builtin goal thumbv6m-none-eabi
(ARM v6-M Thumb, no vendor or OS, embedded ABI), however there is a massive constraint: my CPU shouldn’t be a full ARM CPU.
Since not all directions are supported (some are emulated by my home made assembler), I can not simply construct ARM binaries and cargo them. I would like to make use of my very own assembler, so I am calling the compiler instantly and telling it to emit uncooked meeting code, which is then despatched to my assembler that lastly generates a loadable binary file.
Moreover, since I am operating code with out an OS, with none exterior code, I can not use Rust’s normal library. This can be a completely supported use case (known as no_std
), and doesn’t suggest I can not use something in any respect: it solely signifies that as a substitute of utilizing the std
crate (the same old normal library), I exploit the core
crate that incorporates solely the naked requirements and particularly doesn’t rely on an working system operating beneath. The core
crate nevertheless doesn’t embrace something that depends on heap allocations (resembling String
or Vec
), these are discovered within the alloc
crate that I do not use both for numerous advanced causes associated to my construct system.
Mainly, I wrote my very own normal library. I am now capable of write applications like:
fn predominant() {
println!("Hi there, world!");
display::circle(10, 10, 20, ColorSimple::Crimson);
let x = fp32::from(0.75);
let mut video = display::tty::clean().offset(50, 50);
println!("sin(", Blue.fg(), x, Black.fg(), ") = ", Inexperienced.fg(), x.sin(), => &mut video);
}
and get:
Pitfalls
Utilizing rustc’s uncooked meeting output signifies that I can not depend on code from different crates than the one I am constructing (that will require utilizing the linker, which I’m not utilizing right here). I can not even use the compiler’s intrinsics: capabilities resembling memcpy
or memclr
are sometimes used to carry out block copies, however they don’t seem to be current within the generated meeting, so I needed to implement them myself (I borrowed some code from Redox right here).
One other drawback is that since I’m emulating some directions (by translating them into sequence of different, supported directions), department offsets can get greater than what the compiler anticipated. Drawback: conditional branches on Thumb take an 8-bit signed instant, so in case you attempt to leap greater than 128 directions forward or behind, you’ll be able to’t encode that instruction.
In apply, which means that I typically need to extract code blocks from capabilities to make them smaller, and that the entire codebase is sprinkled with #[inline(never)]
to pressure the compiler to maintain these blocks in separate capabilities.
Implementing a useable normal library shouldn’t be the simplest job; guaranteeing that the entire thing is ergonomic and nice to make use of is even tougher. I had to make use of many unstable (nightly-only) options, resembling GATs, related sort defaults, and specializations, amongst others.
This complete challenge comforts my alternative to make use of Rust for future low-level/embedded growth initiatives. Doing a hundredth of this in C would have been orders of magnitude tougher and the code would not have been wherever as readable as this one is.
Showcase
Plotter
This plotter makes use of the fixed-point (16.16) numeric library and the video show.
Trigonometric capabilities had been applied utilizing Taylor sequence (I do know, CORDIC, however I like ache).
BASIC interpreter
This can be a easy BASIC interpreter / REPL, just like what was discovered on dwelling computer systems of the 80s (e.g. C64). You’ll be able to enter applications line by line, show them, and run them. The supported directions are PRINT
, INPUT
, CLS
, GOTO
and LET
. The immediate helps LIST
, RUN
, LOAD
, ASM
and ASMRUN
.
Packages may also be loaded over the community (just like cassette loading on the C64), with a program resembling:
cat $1 <(echo) | # learn file, add newline
dos2unix | # convert line ends to Unix (LF)
nc -N ::1 4567 # ship to port
Right here, LOAD
begins listening and the acquired traces are displayed with a #
prefix and browse as in the event that they had been being typed by the consumer:
Packages may also be compiled to Thumb meeting for greater efficiency:
In case you are questioning, here is how that half works:
- every BASIC instruction and expression is transformed (compiled) to a sequence of meeting directions, for instance
CLS
merely shops 12 (f
) within the tackle of the terminal output, whereas an expression merely units up a bit stack machine that evaluates the system and shops the end inr1
, and aLET
instruction evaluates its expression and shopsr1
into the variable’s reminiscence cell - as soon as this system has been compiled, it is executed like a operate, like this:
let ptr = directions.as_ptr();
let as_fn: extern "C" fn() -> () = unsafe { core::mem::transmute(ptr) };
as_fn();
- a
bx lr
instruction is appended on the finish of the instruction checklist, so when this system finishes, it offers the management again to the interpreter
Web server
This program listens for HTTP requests, parses them, and processes them.
Scheme
That is an REPL for a small however useable sufficient subset of R6RS. It helps most essential primitive varieties, many builtins and macros.
Supported datatypes are symbols, integers, booleans, strings, lists (internally saved as vectors), void and procedures (both builtin operate or user-defined closure).
As with the BASIC interpreter, applications may be loaded over the community:
Terminal emulator
This can be a easy terminal emulator that helps a subset of the ANSI (VT) escape codes (sufficient to show fairly colours and transfer the cursor round).
It makes use of a 5×7 font borrowed from here, and the ANSI decoding logic was written by hand (there are crates on the market that just do that, however they help all of the ANSI codes, whereas I solely wanted a really small subset for this program).
A script resembling this one can be utilized to begin a shell and pipe it to the circuit:
exec 3<>/dev/tcp/127.0.0.1/4567
cd ~
unbuffer -p sh -c 'stty echo -onlcr cols 80 rows 30 erase ^H;sh' <&3 1>&3 2>&3
MIDI player
Digital offers a MIDI output part, that helps urgent or releasing a key for a given instrument, so I wrote a easy program that makes use of midly to parse a MIDI file despatched over community after which decodes helpful messages to play the tune.
Since channels are saved sequentially in a MIDI file, and since I solely have one output anyway, I wrote an algorithm that merges channels collectively right into a single message checklist. Occasions are already saved chronologically, so that is merely a “merge okay sorted arrays” drawback, that may be solved by recursively merging halves of the array (conventional divide-and-conquer method) in (O(n okay log okay)) ((okay) arrays of (n) objects).
This is the end result:
All in all, this was enjoyable. ARM/Thumb is an efficient structure to implement as a aspect challenge, because it’s well-supported by compilers and a sufficiently small subset of it may suffice to run attention-grabbing code. Logisim and Digital are each wonderful instruments for digital circuit simulation. Rust is good for doing issues and making stuff.
Repository obtainable here. Do not have a look at the commit messages.