16-bit Serial Homebrew CPU – 2023
![](https://blinkingrobots.com/wp-content/uploads/2023/08/16-bit-Serial-Homebrew-CPU-2023-1600x800.jpg)
16-bit Serial Homebrew CPU – 2023
Constructing a homebrew CPU from scratch takes a lot of logic chips. It’s comprehensible, that implementing registers, program counter, ALU, and different elements of the CPU in TTL or CMOS logic does require a considerable quantity of chips. However what number of precisely?
I’ve tried to optimize my homebrew CPU for the bottom quantity of logic chips attainable and reply a query:
How few ICs are required for a Turing-complete CPU with no CPU?
My reply is a 16-bit Serial CPU with solely 8 ICs, together with reminiscence and clock. It has 128kB SRAM, 768kB FLASH, and might be clocked as much as 10MHz. It incorporates solely a 1-bit ALU, however the majority of its 52 directions function on 16-bit values (serially). At its most pace, it executes roughly 12k directions per second (0.012MIPS) and, amongst different issues, is able to streaming a video on PCD8544 primarily based (Nokia 5110) LCD at ~10 FPS.
Relying on the place you place the road between a state machine and a CPU, my 16-bit system would possibly truly be the CPU with the bottom quantity of ICs. Though, another contestants are Jeff Laughton’s 1-bit computer with 1 instruction and 1-bit of reminiscence, and Daniel Thornburgh’s Simple CPU with 1 byte-byte-jump instruction and reminiscence simulated on a Raspberry PI.
{Hardware}:
The structure is impressed by different CPU builds like James Sharman’s JAM-1, Ben Eater’s SAP-1, Warren’s 4-bit Crazy Small CPU, its 8-bit version, and others. All of them, and plenty of others alike, use a “management” EEPROM, EPROM, or ROM for producing management alerts to the CPU elements. As a result of it’s manner simpler than producing them by logic circuits alone, and since it gives extra flexibility sooner or later, I’ve additionally determined to make use of such a “management” reminiscence, particularly, an EPROM. Opposite to the builds talked about above, I’ve aimed for the bottom attainable chip rely, so I’ve tried to “squeeze” as a lot information processing contained in the reminiscence as attainable, to both decrease the calls for on different CPU elements or higher but, eradicate them utterly. Listed here are some key steps taken:
- Fully eliminating the ALU and implementing it as a lookup. As a result of most EPROMs have solely 8-bit output and the system additionally wants different management alerts, the information width of the ALU must be drastically restricted. To not fear, it may be decreased all the best way all the way down to a single bit: 1-bit computing is definitely all we want.
- To get any significant computation carried out, the output from the 1-bit ALU must be serialized. That may be a good use case for a serial SRAM, which additionally brings different advantages. First, it eliminates the necessity for registers, since all ALU operations might be carried out immediately on the information in SRAM. Second, serial SRAMs are additionally addressed serially, so there is no such thing as a must latch the supply and vacation spot addresses. Third, an arbitrary information processing width might be achieved simply by choosing the variety of SRAM clock cycles. I selected 16 bits (16 SRAM clock cycles per 1 ALU operation) as a pleasant compromise between utility and pace.
- At the least 2 serial SRAM chips are required, one among them has to offer a serialized enter to our 1-bit ALU, and, on the similar time, the second has to retailer the end result.
- For ALU operations with 2 operands (like ADD/AND/XOR…), 2 serialized inputs are wanted. Including a 3rd SRAM may actually be an choice (2 for ALU inputs, 1 for end result), however there’s a higher answer. If a serial FLASH reminiscence is used as an alternative of an SRAM, the identical advantages stay (already serialized information, serialized handle), however the FLASH can be utilized for storing the directions/program in addition to offering the ALU enter.
- It’s pointless so as to add any {hardware} for a program counter, as there’s already loads of house contained in the SRAMs the place its worth might be saved.
Even with these dramatic simplifications, some extra {hardware} remains to be required, nonetheless, every thing might be constructed with simply 8 chips in whole, following the schematic beneath:
The circuit is constructed round a 128kB M27C1001-15 EPROM, working at 5V and mixing a management state machine with a 1-bit ALU. Its output strains are latched by a 74HC574 each clock cycle and management the 2 23LCV512 64kB serial SRAMs and one W25Q80 1MB serial FLASH. There are usually not sufficient outputs to regulate every reminiscence individually, in order that they share the information line and partially additionally the chip choose line, solely the clock strains are saved separate. I couldn’t discover a 5V serial FLASH reminiscence, so resistors R3, R4, and R5 restrict the present and kind a bridge from 5V to three.3V. I don’t rely the MCP1703 3.3V voltage regulator as a part of the CPU (I contemplate it, to be part of an influence provide), however with it, my CPU incorporates 9 chips.
The present instruction is saved in a buffered shift register 74HC595, which has its management strains additionally partially shared with the recollections. Each instruction takes a few cycles to finish, so the progress inside an instruction is tracked by a “microcode” counter 74HC393. After the instruction completes, the “Counter_reset” line resets the “microcode” counter and begins the execution of the subsequent instruction buffered in 74HC595.
The 74HC574 and the “microcode” counter 74HC393 use reverse clock edge, so the clock generator 74HC14 supplies an inverted clock sign to the 74HC393 to make them each synchronous.
Inputs and Outputs:
One factor I used to be not capable of moderately implement in my CPU is a self-programming of the FLASH reminiscence. Bootloader is, due to this fact, not attainable, and importing a brand new program to the serial FLASH have to be carried out externally. For this objective, I used an Attiny13 microcontroller that listens to a set of instructions over a UART, so any USB-UART adapter is adequate for importing a brand new code. When programming, it disables the output of the 74HC574 by way of the “Prog_en” line and proceeds to program the FLASH reminiscence immediately. The microcontroller is used just for importing a brand new program and the CPU is fortunately operating with out it.
The one obtainable outputs are the 2 higher bits of the instruction shift register 74HC595. I used one among these inverted strains as a chip choose, which permits the CPU to hook up with an SPI-like system. For instance, a 3.3V PCD8544-based SPI LCD (Nokia 5110) might be linked immediately, with the second higher instruction bit performing because the LCD information/command selector.
Another choice is connecting an extra 74HC595 shift register as an alternative of an LCD to get basic digital output strains.
The one obtainable inputs are the 2 reminiscence information/enter alerts linked to the EPROM handle strains (A9, A11). The serial recollections preserve these alerts at excessive impedance when they don’t seem to be in use, to allow them to be sampled as normal digital inputs when the recollections are idle. You will need to be aware that the enter sign should not intervene with the reminiscence information, so a excessive resistance between the enter sign and the reminiscence enter line is required (R6, R7). Sidenote: studying the enter sign on reminiscence information strains is working for clock frequencies solely as much as about 8MHz. At increased frequencies, the sampled information appear to be erratic and the CPU is vulnerable to stalling.
You would already see my CPU enjoying the “Dangerous Apple!!” music video on a PCD8544 LCD someplace on prime of this web page. Within the video beneath, I show the likelihood to regulate normal digital outputs by including one other 74HC595. The identical circuit can be utilized to provide 8-bit music at as much as 4300 samples per second if an R-2R ladder is used as an alternative of LEDs, and it’s the similar circuit I used to provide the soundtrack for the “Dangerous Apple!!” video.
Reminiscence map:
The CPU has no devoted registers, however it has two SRAMs that it will probably learn from and write to. The draw back is that every time the CPU needs to entry information, it has to put in writing the total 16-bit handle to the serial SRAM. The upside is that as a result of it has to put in writing the total 16-bit handle anyway, the CPU (and directions usually) can entry all the 64kB of the SRAM at a relentless time.
I’ve chosen one SRAM (U8/RAM1) for use for holding this system information, and all arithmetic and logical operations are designed to be carried out on values inside this reminiscence. The second SRAM (U7/RAM2) is supposed for use for a stack, so just a few directions are capable of entry and modify its contents.
The primary few bytes of each recollections are reserved for storing the interior CPU state like this system counter, the flag bit, the stack pointer, an intermediate end result, the supply/vacation spot addresses, and different internally used values. The approximate reminiscence map is within the desk beneath:
Tackle: | 0x0 | 0x1 | 0x2 | 0x3 | 0x4 | 0x5 | 0x6 | 0x7 | 0x8 | 0x9 | 0xA | 0xB | 0xC | 0xD | 0x000E~0xFFFF |
RAM1: | Flag & Enter | Program counter (PC) | Program counter reversed | Stack pointer (SP) | Stack worth (SPVAL) | Registers and person information | |||||||||
RAM2: | Flag | Program counter (PC) | Vacation spot handle | Instruction’s end result | Stack and person information |
One factor I wish to point out is the tactic of utilizing the FLASH reminiscence because the second ALU enter. As a result of the FLASH is sort of giant (1MB), it’s attainable to suit inside it a full 16-bit lookup desk containing 16-bit similar values. With this 128kB lookup, it’s then attainable to put in writing a 16-bit worth to the FLASH as an handle and skim again the identical 16-bit worth as information, which can be utilized as an ALU enter.
A slight inconvenience of utilizing the serial recollections is that they’re addressed in an MSB-first format, whereas the 1-bit ALU naturally computes in an LSB-first format. To get a purposeful reminiscence addressing, we have to reverse the bits from the LSB-first format the CPU works with to the MSB-first format the recollections work with. Reversing bits utilizing a 1-bit ALU is just not simple, so I’ve reserved one other 128kB of FLASH reminiscence for a “reversed-values” lookup desk to make the operation sooner. It really works the identical manner because the earlier lookup, a worth is written to the FLASH reminiscence as an handle and its reversed illustration is learn again as information.
These two 16-bit lookup tables are the rationale my CPU has solely 768kB of FLASH reminiscence and why this system counter (PC) begins at handle 0x040000 and never zero.
Instruction set:
There are some restrictions for the instruction set arising from the restricted {hardware}. The CPU is able to solely 64 distinctive directions/operations, all of which have to slot in the utmost of 256 micro-instruction steps and should get by working with solely a 1-bit ALU and 1 Flag bit. However even with these limitations, surprisingly, it’s attainable to create fairly a nice-looking instruction set:
OP code | Identify | Operands | Width | Flag | Cycles | Complete | Description |
0x00 | INIT | – | – | clear | 256 | 256 | Watch for clock to stabilize, then initialize RAM ICs to sequential mode |
0x01 | RESET | – | – | clear | 235 | 235 | Set program counter PC = 0x040000 and stack pointer SP = 0x000A |
0x02 | – | – | – | – | 158 | 414 | Shadow instruction: Fetch |
0x03 | – | – | – | – | 256 | 414 | Shadow instruction: Fetch continuation |
0x04 | – | – | – | – | 129 | 129 | Shadow instruction: Increment program counter PC = PC + 3 |
0x05 | – | – | – | – | 129 | 129 | Shadow instruction: Increment program counter PC = PC + 5 |
0x06 | – | – | – | – | 129 | 129 | Shadow instruction: Increment program counter PC = PC + 7 |
0x07 | – | – | – | – | 129 | 129 | Shadow instruction: Increment program counter PC = PC + 8 |
0x08 | – | – | – | – | 162 | 291 | Shadow instruction: Copy 32 bit end result |
0x09 | – | – | – | – | 130 | 259 | Shadow instruction: Copy 16 bit end result |
0x0A | – | – | – | – | 113 | 113 | Shadow instruction: Copy program counter |
0x0B | – | – | – | – | 167 | 296 | Shadow instruction: Retailer to RAM oblique |
0x0C | – | – | – | – | 151 | 280 | Shadow instruction: Retailer to RAM oblique |
0x0D | – | – | – | – | 173 | 587 | Shadow instruction: Arithmetic instruction dispatch |
0x0E | STF | – | – | set | 132 | 546 | Set FLAG |
0x0F | CLF | – | – | clear | 132 | 546 | Clear FLAG |
0x10 | NOP | – | – | – | 132 | 546 | No operation |
0x11 | MOV | addr16 <- addr16 | 16 | – | 231 | 774 | Transfer 16 bit worth |
0x12 | MOVW | addr16 <- addr16 | 32 | – | 146 | 851 | Transfer 32 bit worth |
0x13 | INC | addr16 <- addr16 | 16 | overflow | 231 | 774 | Increment |
0x14 | DEC | addr16 <- addr16 | 16 | overflow | 231 | 774 | Decrement |
0x15 | COM | addr16 <- addr16 | 16 | zero | 231 | 774 | 1’s complement (NOT) |
0x16 | NEG | addr16 <- addr16 | 16 | zero | 231 | 774 | 2’s complement |
0x17 | LSL | addr16 <- addr16 | 16 | overflow | 233 | 776 | Left shift (<<) |
0x18 | LSR | addr16 <- addr16 | 16 | overflow | 233 | 776 | Proper shift (>>) |
0x19 | ROL | addr16 <- addr16 | 16 | overflow | 233 | 776 | Left shift with carry |
0x1A | ROR | addr16 <- addr16 | 16 | overflow | 255 | 798 | Proper shift with carry |
0x1B | ASR | addr16 <- addr16 | 16 | overflow | 235 | 778 | Arithmetic proper shift (retains signal bit) |
0x1C | REV | addr16 <- addr16 | 16 | – | 238 | 781 | Bit reverse |
0x1D | ADDI | addr16 <- addr16, val16 | 16 | overflow | 231 | 774 | Add fast |
0x1E | ADCI | addr16 <- addr16, val16 | 16 | overflow | 231 | 774 | Add fast with carry |
0x1F | SUBI | addr16 <- addr16, val16 | 16 | overflow | 231 | 774 | Subtract fast |
0x20 | SBCI | addr16 <- addr16, val16 | 16 | overflow | 231 | 774 | Subtract fast with carry |
0x21 | ANDI | addr16 <- addr16, val16 | 16 | zero | 231 | 774 | Logical AND with fast |
0x22 | ORI | addr16 <- addr16, val16 | 16 | zero | 231 | 774 | Logical OR with fast |
0x23 | XORI | addr16 <- addr16, val16 | 16 | zero | 231 | 774 | Logical XOR with fast |
0x24 | ADD | addr16 <- addr16, addr16 | 16 | overflow | 171 | 887 | Add register |
0x25 | ADC | addr16 <- addr16, addr16 | 16 | overflow | 171 | 887 | Add register with carry |
0x26 | SUB | addr16 <- addr16, addr16 | 16 | overflow | 171 | 887 | Subtract register |
0x27 | SBC | addr16 <- addr16, addr16 | 16 | overflow | 171 | 887 | Subtract register with carry |
0x28 | AND | addr16 <- addr16, addr16 | 16 | zero | 171 | 887 | Logical AND with register |
0x29 | OR | addr16 <- addr16, addr16 | 16 | zero | 171 | 887 | Logical OR with register |
0x2A | XOR | addr16 <- addr16, addr16 | 16 | zero | 171 | 887 | Logical XOR with register |
0x2B | JMP | addr24 | – | – | 197 | 611 | Soar to deal with |
0x2C | CALL | addr24 | 32 | – | 221 | 748 | Copy following instruction’s handle (PC + 4) and present FLAG to SPVAL, then leap |
0x2D | RET | – | 32 | restore | 138 | 552 | Transfer SPVAL to PC & FLAG (successfully returns from CALL and restores earlier FLAG) |
0x2E | BRFS | addr24 | – | – | 160 | 625|574 | Department if FLAG set |
0x2F | BRFC | addr24 | – | – | 160 | 625|574 | Department if FLAG cleared |
0x30 | BREQ | addr16, addr24 | 16 | – | 243 | 708|657 | Department if register is zero |
0x31 | BRNE | addr16, addr24 | 16 | – | 243 | 708|657 | Department if register is just not zero |
0x32 | LDI | addr16 <- value16 | 16 | – | 81 | 624 | Load 16 bit fast |
0x33 | LDIW | addr16 <- value32 | 32 | – | 113 | 656 | Load 32 bit fast |
0x34 | LD | addr16 <- [addr16] | 16 | – | 238 | 911 | Oblique load 16 bits from handle |
0x35 | LDB | addr16 <- [addr16] | 8 | – | 238 | 911 | Oblique load 8 bits from handle, set higher 8 bits to 0 |
0x36 | ST | [addr16] <- addr16 | 16 | – | 163 | 873 | Oblique retailer 16 bits to deal with |
0x37 | STB | [addr16] <- addr16 | 8 | – | 163 | 857 | Oblique retailer 8 bits to deal with |
0x38 | LD2W | [addr16] | 32 | – | 256 | 799 | Oblique load 32 bits from handle in RAM2 to SPVAL register |
0x39 | LD2 | [addr16] | 16 | – | 224 | 767 | Oblique load 16 bits from handle in RAM2 to SPVAL register |
0x3A | ST2W | [addr16] | 32 | – | 256 | 799 | Oblique retailer 32 bits from SPVAL register to deal with in RAM2 |
0x3B | ST2 | [addr16] | 16 | – | 224 | 767 | Oblique retailer 16 bits from SPVAL register to deal with in RAM2 |
0x3C | LPM | addr16 <- [addr16] | 16 | – | 211 | 884 | Oblique load 16 bits from handle in FLASH |
0x3D | LPB | addr16 <- [addr16] | 8 | – | 211 | 884 | Oblique load 8 bits from handle in FLASH, set higher 8 bits to 0 |
0x3E | OUT | addr16 | 8 | – | 252 | 795 | Output 8 bits over SPI |
0x3F | HALT | – | – | clear | 14 | 428 | Cease execution |
The primary directions, INIT and RESET, are executed at power-up or when the RESET button is pressed. The “shadow” directions are non-user-accessible directions, primarily used for repeating operations like fetching an instruction, program counter increment, end result write-back, and related.
Arithmetic and logical operations use the one Flag bit as both a Carry/Overflow or a Zero flag. As talked about above, there is no such thing as a efficiency penalty for accessing the total handle house, so all these directions can specify any supply/vacation spot handle throughout the 64kB SRAM handle house. Oblique addressing for arithmetic operations is just not supported immediately however have to be carried out by LD/ST (load/retailer) directions.
The second set of LD2/ST2 directions is accessing the second SRAM. They’re meant for use for stack, however any information might be saved. PUSH and POP directions are usually not applied, however they are often constructed from LD2/ST2 and INC/DEC directions.
A mean instruction takes about 800 clock cycles, together with fetch operation and program counter increment. On the most clock frequency of 10MHz, the CPU can execute about 12k directions per second.
Writing code in assembler:
I exploit Lorenzi’s customasm instrument to generate binary information from meeting supply code. The binary information can then be uploaded utilizing a small python3 utility to the Attiny13 programming microcontroller that writes the binary into the FLASH.
Beneath are two examples of small subroutines written in assembler for my CPU. The primary subroutine returns the 32-bit results of two 16-bit values multiplication. The second writes an ascii string saved contained in the FLASH reminiscence to the LCD.
Multiply32_16x16 | LCD_WriteStrF |
; Returns FA32 = FA16 * FB16 ; FB is anticipated to be smaller Multiply32_16x16: ;PUSH_PC ; Not crucial LDIW FC, 0 ; Clear end result LDI FA+2, 0 ; Solid FA16 to FA32 .loop: ANDI TMP, FB, 1 BRFS .skip_add ADD FC, FA ; Add FC32 += FA32 ADC FC+2, FA+2 ; Add FC32 += FA32 .skip_add: LSL FA ; Shift FA32 << 1 ROL FA+2 ; Shift FA32 << 1 LSR FB ; Shift FB16 >> 1 BRNE FB, .loop MOVW FA, FC ; Copy end result ;POP_PC ; Not crucial RET |
; Write String in Flash ; enter: FA32 <- Tackle of the string in Flash LCD_WriteStrF: PUSH_PC ; Save return handle PUSHW RA ; Save RA 32-bit MOVW RA, FA .loop: LPB FA, RA ; Load character from Flash BREQ FA, .cease ; Take a look at "\0" character REV FA ; MSB-first -> LSB-first ANDI FA, FA+1, 0xFF ; Solid to 8bits CALL LCD_WriteChar ; Write character ADDI RA, 1 ; Enhance 32-bit pointer ADCI RA+2, 0 ; Enhance 32-bit pointer JMP .loop .cease: POPW RA ; Restore RA 32-bit POP_PC ; Restore return handle RET |
Most Frequency and the Important path:
In line with specs, the entire propagation delay throughout the important path is:
- 12ns at 74HC14 from “Clock_pos” to “Clock_neg”,
- 54ns at 74HC393 to ripple to the final eighth bit (12+3×5+12+3×5 ns),
- 150ns entry time at M27C1001-15 EPROM,
- 2ns at 74HC574 to stabilize the inputs earlier than the clock edge.
Placing it collectively, the circuit ought to solely be capable to run at ~4.6MHz. My particular construct, nonetheless, is ready to work flawlessly as much as 10MHz and turns into unstable solely above ~10.5MHz. For a circuit constructed on a breadboard with loads of parasitic capacitance, I contemplate it fairly spectacular. The utmost clock fee would possibly even be improved if a sooner binary counter or sooner EPROM had been used.
Conclusion and Retrospective:
I’m actually happy with the completed CPU. It has good and “easy-to-work-with” instruction set with all the essential directions current. It’s highly effective sufficient to stream a video on a small LCD display screen, play audio (although utilizing an exterior “sound card”), and customarily carry out the easy enter/output computational operations it was initially meant for. Lastly, it successfuly demonstrates that it’s attainable to construct a purposeful homebrew CPU with solely a handful of ICs.
There are, nonetheless, some small enhancements attainable:
- The 74HC393 ripple counter is a major bottleneck within the important path. Changing it with a carry-lookahead adder (“quick adder”) or with a buffered counter like 74HC590 would enhance the utmost clock pace.
- The identical goes for the M27C1001-15 EPROM. Utilizing a sooner reminiscence like M27C1001-35 EPROM or SST39SF020A-70 FLASH would additionally enable increased clock fee.
- A bigger EPROM with greater than 17 handle strains might be used to both enhance the instruction rely or to make the most of the extra handle strains as normal digital inputs.
- Including some directions for erasing and programming the interior FLASH reminiscence would have enabled a bootloader to be made, which might make the Attiny13 programming circuit pointless.
- The system is ready to execute code solely from the FLASH reminiscence. It could be attainable to create an emulator contained in the FLASH and make the emulator execute code from SRAM however to make the CPU execute code from SRAM natively, a special instruction fetching course of could be required, presumably together with a replica set of directions only for the SRAM execution itself.
I’ll have but to see if a few of these enhancements appear to be worthwhile to implement. Within the meantime, when you just like the undertaking and wish to dive deeper, you possibly can skim by means of the supply code obtainable here. It incorporates a simulator, EPROM microcode generator, Attiny13 programmer firmware, and all of my assembler codes.
Replace:
I’ve applied a minimalistic 3D wireframe object projection engine utilizing 16-bit fixed-point arithmetic. Multiplying matrices on my 0.012 MIPS CPU is sort of sluggish, so 3D video games are in all probability not coming anytime quickly:
I am additionally slowly rising the checklist of {hardware} my CPU immediately helps. I’ve added an SPI alphanumeric LCD I’ve salvaged from an outdated HP printer:
and I have been capable of “bit-bang” the serial interface for DS1302 real-time clock. The software program does have to make use of some particular instruction sequences to provide the required alerts, however it’s attainable and doesn’t require any extra {hardware}.
Replace 2:
The CPU now helps a PCF8833 LCD driver, though one body takes about 96 seconds to render.
Homebuilt CPUs WebRing
Undoubtedly try different superior homebrew CPU builds on Warren’s https://www.homebrewcpuring.org
Be part of the ring?
To affix the Homebuilt CPUs ring, drop Warren a line (mail is obfuscated, it’s important to change [at] to @), mentioning your web page’s URL. He’ll then add it to the checklist.
You have to to repeat this code fragment into your web page (or reference it.)
Notice: The ring is chartered for tasks that embody a home-built CPU. It might emulate a business half, that′s OK.
However truly utilizing that business CPU doesn′t fee. Likewise, the undertaking will need to have been not less than partially constructed: pure paper designs don′t fee both.
It may be constructed utilizing any know-how you want, from relays to FPGAs.
Copyright © 2023 Jiri Stepanovsky. All Rights Reserved.