Naked metallic C on my RISC-V toy CPU · Florian Noeding’s weblog
I all the time wished to know how a CPU works, the way it transitions from one instruction to the following and makes a pc work. So after studying Ken Shirrif’s blog about a bug fix in the 8086 processor I assumed: Effectively, let’s attempt to write one in a {hardware} description language. This publish is a write up of my studying experiment.
I’ll stroll by means of my steps of making an emulator, compiling and linking C for naked metallic, CPU design and at last the implementation of my toy RISC-V CPU.
- implement a CPU in a {hardware} description language (HDL),
- code have to be synthesizable (besides reminiscence),
- simulate it,
- and run a naked metallic C program on it.
Whereas I had loads analysis time, I had solely about 30 hours of improvement time. With out prior {hardware} design expertise the objectives needed to be easy sufficient:
- RISC V Integer instruction set solely (minus system and break calls),
- no interrupt dealing with or different complexities,
- no or solely minimal optimizations.
My check program written in C is Conway’s recreation of life, strive the interactive web version.
In a future iteration I plan to run my venture on an actual FPGA and implement a reminiscence controller. Let’s say I made it to about 85% of what I wished to attain.
Writing an emulator, i.e. a program that may execute the directions of a CPU, is a wonderful stepping stone in the direction of a {hardware} implementation. For me – with out a {hardware} background – it’s a lot simpler to cause about and be taught the instruction set.
So my first step was to know the RISC V instruction set. The RISC V specification is kind of lengthy, however I solely wanted chapters 2 (integer instruction set), 19 (RV32 / 64G instruction set listings) and 20 (meeting programmer’s handbook). These give detailed definitions of how every instruction have to be executed, what sort of registers have to be applied, and so forth.
Let’s take a look at an instance: Arithmetic / logical directions working on a register and an instantaneous worth. For ADDI
(add rapid) the next is completed: rd <- rs1 + imm
: The register recognized by rd is ready to the sum of the worth saved in register rs1 and the worth given within the instruction itself.
The emulator is applied in C++ with a C-style interface in a shared library. This makes it straightforward to hook it into Python through cffi. This saved me fairly a little bit of time for file system and consumer interactions, which had been all completed in Python.
struct RV32EmuState;
extern "C" RV32EmuState* rv32emu_init();
extern "C" void rv32emu_free(RV32EmuState *state);
extern "C" void rv32emu_set_rom(RV32EmuState *state, void *knowledge, uint32_t measurement);
extern "C" void rv32emu_read(RV32EmuState *state, uint32_t addr, uint32_t measurement, void *p);
extern "C" int32_t rv32emu_step(RV32EmuState *state);
extern "C" int32_t rv32emu_run(RV32EmuState *state, uint32_t *breakpoints, uint32_t num_breakpoints);
extern "C" void rv32emu_print(RV32EmuState *state);
The core of the emulator is the rv32emu_step
operate, which executes precisely one instruction after which returns. In RISC V every instruction is strictly 32 bits. It first decodes the op code (what sort of operation) after which the precise operation (e.g. ADD rapid). It’s a big, nested change.
int32_t rv32emu_step(RV32EmuState *state) {
uint32_t *instr_p = (uint32_t*)(_get_pointer(state, state->computer));
uint32_t instr = *instr_p;
// decode
uint32_t opcode = instr & 0x7F; // instr[6..0]
change(opcode) {
//
// ... all opcode varieties ...
//
case OPCODE_OP_IMM: {
uint32_t funct3 = (instr >> 12) & 0x07; // instr[14..12]
uint32_t rs1 = ((instr >> 15) & 0x1F);
uint32_t imm = (instr >> 20) & 0x0FFF; // 12 bits
uint32_t imm_sign = instr & (1ul << 31);
if(imm_sign) = 0xFFFFF000;
uint32_t data1 = state->reg[rs1];
uint32_t data2 = imm;
uint32_t reg_dest = (instr >> 7) & 0x1F;
if(reg_dest != 0) { // register 0 is all the time zero and by no means written too
change(funct3) {
//
// ... all OP_IMM directions ...
//
case FUNCT3_OP_ADD: {
state->reg[reg_dest] = data1 + data2;
break;
}
}
}
break;
}
// ...
state->computer += 4; // go to subsequent instruction for non-jump / non-branch
return 0;
}
The mathematical and logical operations are easiest to implement, so I began with them. Iteratively I’ve added branches, jumps and the remaining logic till I had lined all directions aside from ECALL
and EBREAK
. These two weren’t obligatory for my naked metallic experiment.
For testing I relied on easy hand-written meeting code. After all this didn’t train my emulator totally. In order a subsequent step I wished to lastly run my Conway’s recreation of life simulation.
Going from C to a naked metallic CPU takes a number of steps: cross compile, guarantee correct reminiscence structure and changing the ELF file to a binary blob. Additionally as a substitute of getting a most important
operate my code has a _start
operate outlined as follows:
void __attribute__((part (".textual content.boot"))) _start() {
run(); // name the precise "entrypoint"
}
I’ll clarify the small print later.
My CPU solely helps the RISC-V 32 bit integer instruction set, however my host system is operating on x86-64. So I wanted a cross compiler and used the Ubuntu package deal gcc-riscv64-unknown-elf
. Then I may compile my code utilizing the next command:
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32
-nostdlib -ffreestanding -Tprograms/memory_map.ld
-o life.rv32.elf life.c
Let’s take this aside:
- execute the RISC-V cross-compiler
- set it’s structure to rv32i, which is RISC-V 32bit integer instruction set
- outline the applying binary interface, i.e. conventions tips on how to emit meeting. This makes it in order that integers, longs and pointers are 32 bit
- these three are wanted to emit code appropriate with my emulator and later CPU
- compile with out a normal library
- Normal libraries just like the system libc assume working system help, however my toy CPU can be operating naked metallic. So we’ve to modify that off. This implies we received’t have entry to
malloc
,printf
,places
and so forth. As an alternative we’ll must implement this ourselves, if we need to use it. This implies we’ve no startup code both.
- Normal libraries just like the system libc assume working system help, however my toy CPU can be operating naked metallic. So we’ve to modify that off. This implies we received’t have entry to
- compile freestanding, that’s, don’t assume presence of a working system or library and change of libc particular optimizations and defaults
- for instance we received’t have a most important operate, which is in any other case required
- use a reminiscence map
- we have to inform the compiler and linker the place directions and world variables can be positioned in reminiscence. We do not need a loader to do that at utility startup
Despite the fact that we don’t but have any {hardware}, we should make a number of selections for merchandise 6: how ought to our tackle area seem like?
- program execution begins at 0x1000, which beneath I’ll name rom for read-only-memory
- reminiscence for globals, stack variables and heap can be situated at 0x10000000
These values are sort of arbitrary. I wished to keep away from having code at tackle zero to keep away from points with NULL pointers. This script additionally ensures that our program entry level, the operate _start
is positioned at 0x1000, in order that the emulator will execute that code first. Right here’s my linker script for my tackle area setup:
ENTRY(_start)
MEMORY
{
rom (rx ): ORIGIN = 0x00001000, LENGTH = 16M
ram (rw): ORIGIN = 0x10000000, LENGTH = 32M
}
SECTIONS
{
.textual content : {
/*
entry level is predicted to be the primary operate right here
--> we're assuming there's solely a single operate within the .textual content.boot phase and by conference that's "_start"
KEEP ensures that "_start" is stored right here, even when there are not any references to it
*/
KEEP(*(.textual content.boot))
/*
all different code follows
*/
*(.textual content*)
} > rom
.rodata : { *(.rodata*) } > rom
.bss : { *(.bss*) } > ram
}
After compilation we are able to examine that _start
is definitely at 0x1000:
riscv64-unknown-elf-readelf -s life.rv32.elf | grep '_start$$'
Now the “drawback” is that gcc generates an ELF and never only a stream of directions. The Executable and Linkable Format is simplified a container to retailer executable code, knowledge and metadata in a approach that makes it straightforward to later load into reminiscence. As specified by the reminiscence map just like the one above. Since my program is pretty it easy doesn’t want reminiscence initialization. So we are able to merely dump the RISC-V directions from the .textual content phase right into a binary file.
riscv64-unknown-elf-objdump -O binary life.rv32.elf life.rv32.bin
So now the C snippet from above ought to make extra sense:
void __attribute__((part (".textual content.boot"))) _start() {
run();
}
We’re defining a operate _start
which ought to go into the phase .textual content.boot
. The linker script instructs the toolchain to ensure this code is positioned at 0x1000, even when no different code references it. By having precisely one operate in .textual content.boot
that is assured to occur.
Seems that is nonetheless not sufficient to make the code work. The startup code above doesn’t initialize the stack pointer, i.e. the place native variables reside in reminiscence. I made a decision to simplify issues and hard-code the preliminary stack pointer worth in my emulator and CPU. This implies merely setting register x2
also referred to as sp
to the tip of the reminiscence, right here 0x12000000.
A pair different registers outlined within the ABI with particular goal are usually not utilized by my program, so I didn’t implement help: world pointer gp
and thread pointer tp
.
When this system is operating on my host I depend on the usual library for reminiscence allocation like malloc
or putchar
for output. However when operating naked metallic these features are usually not out there.
I’ve changed dynamic reminiscence allocation with static reminiscence assignments. Since my program is the one one operating on CPU, I can use all sources how I see match. If the flag FREESTANDING
is ready, when this system is compiled for my RISC-V emulator / CPU. With out it, this system can run as-is on my host system like every other program.
void run() {
#ifdef FREESTANDING
map0 = (unsigned char*)0x10080000; // gamestate
map1 = map0 + 0x80000; // new gamestate
leds = map1 + 0x80000; // output for emulator
#else
map0 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
map1 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
#endif
// ...
}
As an alternative of counting on putchar
for output to the console, my program assumes that the tackle of the variable leds
is memory-mapped to an exterior LED array. In case of the emulator, it is going to merely learn this reminiscence space and show it on console. When operating within the simulator (or FPGA within the subsequent iteration), the reminiscence controller will set output pins accordingly.
Right here’s the results of all of that work: First setting a breakpoint for every recreation of life cycle, after which manually stepping by means of this system on the emulated CPU.
Finest seen in full-screen mode as a consequence of web-unfriendly structure.
With the emulator accomplished I now have a working reference system to debug my CPU. And so I began working implementing it.
A easy CPU consists of the next parts:
- Arithmetic Logic Unit (ALU): the compute half, for operations like “add” or “xor”
- Register File: supplies and shops register values
- Decoder: remodel instruction to a set of management indicators, controlling the CPU operation
- Program Counter: manages the tackle the place the following instruction is discovered
- Load Retailer Unit (LSU): connects the CPU to its reminiscence
- Management Unit: tieing all of the components collectively to type a CPU
These are the fundamental parts of a CPU and ample for my toy RISC-V implementation.
{Hardware} is designed with particular programming languages, referred to as {hardware} description languages (HDL). The most typical ones are Verilog and VHDL. For my venture I made a decision to make use of Amaranth HDL, as a result of it’s increased degree and simpler to make use of – plus it’s written in my favourite language Python. Simplified it allows an engineer to explain a program in Python that generates a {hardware} description, as a substitute of immediately describing it immediately in Verilog or VHDL. A pleasant property of Amaranth HDL is that by design the ensuing applications are synthesizable, i.e. they are often “compiled” into an outline executable in FPGAs or constructed as an ASIC.
A key distinction between software program and {hardware} is concurrency: In software program code is executed line by line, so as and we’d like particular constructs like threads to attain parallelism. In {hardware} it’s completely different: All the pieces is going on on the similar time. We aren’t describing high-level operations, however quite how logic gates are related to one another.
Combinational logic
There are two key ideas in {hardware}: combinational logic (typically additionally referred to as combinational logic) and synchronous logic. Simplified combinational logic executes on a regular basis and all on the similar time. Within the following instance inexperienced are enter indicators, yellow are inside indicators (output of logic and enter to the following logic), blue is logic and orange is the ultimate output sign:
graph LR
count_1((sign: rely 1)) –> adder1[[logic: Adder]]
count_2((sign: rely 2)) –> adder1
adder1 –> sum((sign: temp sum))
sum –> adder2[[logic: Adder]]
count_3((sign: rely 3)) –> adder2
adder2 –> output((sign: outcome))
model count_1 fill:#7e6,stroke:#333,stroke-width:1px
model count_2 fill:#7e6,stroke:#333,stroke-width:1px
model count_3 fill:#7e6,stroke:#333,stroke-width:1px
model adder1 fill:#5ae,stroke:#333,stroke-width:1px
model adder2 fill:#5ae,stroke:#333,stroke-width:1px
model sum fill:#ee3,stroke:#333,stroke-width:1px
model output fill:#e90,stroke:#333,stroke-width:1px
Combinational logic all the time updates its output instantly when any enter modifications. There are a pair bodily limitations right here, however we’ll simplify this for now. This implies altering any sign will instantly change the output sign sum
.
In Amaranth we are able to implement this as
# to run assessments: python3 -m pytest add3.py
import pytest
from amaranth import Sign, Module, Elaboratable
from amaranth.construct import Platform
from amaranth.sim import Simulator, Settle
class Add3Comb(Elaboratable):
def __init__(self):
self.count_1 = Sign(32)
self.count_2 = Sign(32)
self.count_3 = Sign(32)
self.outcome = Sign(32)
def elaborate(self, _: Platform) -> Module:
m = Module()
# technically this isn't wanted: a second `+` beneath would do
# however let's construct the circuit precisely as proven above
temp_sum = Sign(32)
# outline how our logic works
m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
m.d.comb += self.outcome.eq(self.count_3 + temp_sum)
return m
def test_add3comb():
# arrange our gadget beneath check
dut = Add3Comb()
def bench():
# set inputs to outlined values
yield dut.count_1.eq(7)
yield dut.count_2.eq(14)
yield dut.count_3.eq(21)
# let the simulation calm down, i.e. arrive at an outlined state
yield Settle()
# examine that the sum is the anticipated worth
assert (yield dut.outcome) == 42
sim = Simulator(dut)
sim.add_process(bench)
sim.run()
The important thing takeaway right here is that the traces
m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
m.d.comb += self.outcome.eq(self.count_3 + temp_sum)
are executed on the similar time and in addition at any time when the inputs change.
Synchronous Logic
There’s a second sort of generally used logic: synchronous logic. The distinction to combinational logic is that outputs solely change on a clock edge. I.e. when the clock sign goes from low to excessive (constructive edge) or vice versa (damaging edge). Let’s use the adder instance once more. Colours as earlier than, however we’ll use turquoise for synchronous logic.
graph LR
count_1((sign: rely 1)) –> adder1[[logic: Adder]]
count_2((sign: rely 2)) –> adder1
adder1 –> sum((sign: temp sum))
sum –> adder2[[logic: Adder]]
count_3((sign: rely 3)) –> adder2
adder2 –> output((sign: outcome))
clock((sign: clock)) –> adder1
clock –> adder2
model count_1 fill:#7e6,stroke:#333,stroke-width:1px
model count_2 fill:#7e6,stroke:#333,stroke-width:1px
model count_3 fill:#7e6,stroke:#333,stroke-width:1px
model clock fill:#7e6,stroke:#333,stroke-width:1px
model adder1 fill:#0ff,stroke:#333,stroke-width:1px
model adder2 fill:#0ff,stroke:#333,stroke-width:1px
model sum fill:#ee3,stroke:#333,stroke-width:1px
model output fill:#e90,stroke:#333,stroke-width:1px
We use constructive edge triggered logic right here. So except the clock goes from low to excessive, each temp sum
and outcome
won’t ever change. The next desk exhibits how values change. Let’s moreover assume the logic was simply resetted, so outputs begin at 0.
sign / time | t=0 | t=1 | t=2 | t=3 |
---|---|---|---|---|
clock | 0 | 1 | 0 | 1 |
count_1 | 7 | 7 | 7 | 7 |
count_2 | 14 | 14 | 14 | 14 |
count_3 | 21 | 21 | 21 | 21 |
temp_sum | 0 | 21 | 21 | 21 |
outcome | 0 | 21 | 21 | 42 |
Adjustments highlighted in daring. This circuit takes on the anticipated worth solely after two full clock cycles. Even when the enter indicators are usually not outlined within the time interval after a constructive edge and the following constructive edge, this won’t change the output in any approach
sign / time | t=0 | t=1 | t=2 | t=3 |
---|---|---|---|---|
clock | 0 | 1 | 0 | 1 |
count_1 | 7 | 7 | 0 | 7 |
count_2 | 14 | 14 | 0 | 14 |
temp_sum | 21 | 21 | 21 |
Bodily issues are extra complicated (“delta time”) and this ends in fascinating tradeoffs between the size of combinational logic paths (variety of gates, circuit size) and the attainable clock velocity. Fortunately this doesn’t matter for my toy CPU.
In Amaranth we are able to implement this as
class Add3Sync(Elaboratable):
def __init__(self):
self.sync = ClockDomain("sync")
self.count_1 = Sign(32)
self.count_2 = Sign(32)
self.count_3 = Sign(32)
self.outcome = Sign(32)
def elaborate(self, _: Platform) -> Module:
m = Module()
temp_sum = Sign(32)
# outline how our logic works
m.d.sync += temp_sum.eq(self.count_1 + self.count_2)
m.d.sync += self.outcome.eq(self.count_3 + temp_sum)
return m
def test_add3sync():
# arrange our gadget beneath check
dut = Add3Sync()
def bench():
# set inputs to outlined values
yield dut.count_1.eq(7)
yield dut.count_2.eq(14)
yield dut.count_3.eq(21)
# let the simulation calm down, i.e. arrive at an outlined state
yield Settle()
# no constructive edge but, so nonetheless at reset worth
assert (yield dut.outcome) == 0
# set off a constructive edge on the clock and look forward to issues to calm down
yield Tick()
yield Settle()
# count3 is mirror in output, since temp sum remains to be zero
assert (yield dut.outcome) == 21
yield Tick()
yield Settle()
# now each count3 and temp sum can be mirrored within the output
assert (yield dut.outcome) == 42
sim = Simulator(dut)
sim.add_process(bench)
sim.add_clock(1e-6)
sim.run()
Armed with this information I discovered which issues wanted to occur in parallel and which issues in sequence.
So if we’ve an ALU associated instruction it might work like this:
- in parallel
- learn instruction from ROM on the instruction tackle,
- decode the instruction,
- learn register values and if current rapid worth,
- compute the outcome within the ALU,
- assign ALU outcome to vacation spot register (not but seen!)
- increment instruction tackle by 4 bytes (not but seen!)
- look forward to constructive clock edge, giving step 1 time to settle, and within the following immediate
- replace instruction tackle, making the brand new worth seen
- replace vacation spot register worth, making the brand new worth seen
- repeat, beginning at 1.
Iteratively making a diagrams of how issues ought to work was immensely useful. Beneath is a simplified model of my CPU design, although it lacks lots of the management indicators and particular instances for operations associated to leaping and branching. Please view in full display mode, the place you too can toggle ALU or LSU layers to make it simpler to learn. Colours listed here are simply to assist with readability of the diagram.
Now let’s discuss concerning the CPU parts intimately. Designing them jogged my memory loads about purposeful programming, the place the parameters of a operate and kinds naturally information the implementation. All obligatory particulars concerning the RISC-V instruction set are laid out in element within the spec.
ALU
Compute data1 $OPERATION data2
. I made a decision to merge the department unit into the ALU, so there’s a department for is_branch
.
class ALU(Elaboratable):
def __init__(self):
# if set to 0, then regular ALU operation,
# in any other case deal with funct3 as department situation operator
self.i_is_branch = Sign(1)
# operation, e.g. "add" or "xor", from decoder
self.i_funct3 = Sign(3)
# sub-operation, e.g. "sub" for "add", from decodert
self.i_funct7 = Sign(7)
# worth of register 1
self.i_data1 = SignedSignal(32)
# worth of register 2 or rapid
self.i_data2 = SignedSignal(32)
# computation outcome
self.o_result = SignedSignal(32)
def elaborate(self, _: Platform) -> Module:
m = Module()
# this ALU additionally implements department logic
with m.If(self.i_is_branch == 0):
# regular ALU
with m.Change(self.i_funct3):
with m.Case(FUNCT3_OP_XOR):
m.d.comb += self.o_result.eq(self.i_data1 ^ self.i_data2)
with m.Case(FUNCT3_OP_SLL):
shift_amount = self.i_data2[0:5]
m.d.comb += self.o_result.eq(
self.i_data1.as_unsigned() << shift_amount)
# ...
On this snippet you may see how Amaranth is mostly a code generator: As an alternative of utilizing the traditional change
and if
statements, which management Python circulation, it’s important to use the m.Change
and so forth. strategies on the module.
Decoder
Present all obligatory indicators to manage execution.
class InstructionDecoder(Elaboratable):
def __init__(self):
self.i_instruction = Sign(32)
self.i_instruction_address = Sign(32)
# choose indicators for register file
self.o_rs1 = Sign(5)
self.o_rs2 = Sign(5)
self.o_rd = Sign(5)
self.o_rd_we = Sign(1)
# ALU / LSU operations
self.o_funct3 = Sign(3)
self.o_funct7 = Sign(7)
# rapid worth
self.o_imm = SignedSignal(32)
# management indicators
self.o_invalid = Sign(1)
self.o_has_imm = Sign(1)
self.o_is_branch = Sign(1)
self.o_is_memory = Sign(2)
def elaborate(self, _: Platform) -> Module:
m = Module()
m.d.comb += self.o_invalid.eq(0)
m.d.comb += self.o_is_branch.eq(0)
opcode = self.i_instruction[0:7]
with m.Change(opcode):
with m.Case(OPCODE_OP_IMM):
# rd = rs1 $OP imm
# use ALU with rapid
m.d.comb += [
self.o_rd.eq(self.i_instruction[7:12]),
self.o_rd_we.eq(1),
self.o_funct3.eq(self.i_instruction[12:15]),
self.o_rs1.eq(self.i_instruction[15:20]),
self.o_rs2.eq(0),
self.o_imm.eq(self.i_instruction[20:32]),
self.o_has_imm.eq(1),
self.o_funct7.eq(0),
]
# ...
Register File
Implement 32 registers. Particular case: register x0
is hard-wired to zero, by not permitting writes to it.
I’m unsure if that is one of the best implementation, nevertheless it works effectively in simulation to date.
class RegisterFile(Elaboratable):
def __init__(self):
self.sync = ClockDomain("sync")
self.i_select_rs1 = Sign(5)
self.i_select_rs2 = Sign(5)
self.i_select_rd = Sign(5)
self.i_we = Sign(1)
self.i_data = SignedSignal(32)
self.o_rs1_value = SignedSignal(32)
self.o_rs2_value = SignedSignal(32)
self.registers = Sign(32 * 32)
self.ports = [self.sync, self.i_select_rs1, self.i_select_rs2, self.i_select_rd, self.i_data, self.i_we]
def elaborate(self, _: Platform) -> Module:
"""
on clock edge if i_we is ready: shops i_data at reg[i_select_rd]
combinationally returns register values
"""
m = Module()
m.d.comb += [
self.o_rs1_value.eq(self.registers.word_select(self.i_select_rs1, 32)),
self.o_rs2_value.eq(self.registers.word_select(self.i_select_rs2, 32)),
]
with m.If((self.i_we == 1) & (self.i_select_rd != 0)):
m.d.sync += self.registers.word_select(self.i_select_rd, 32).eq(self.i_data)
return m
Program Counter
The best element: we begin executing applications at 0x1000 after which go to the following instruction. The decoder computes offset primarily based on the instruction to permit each absolute and relative jumps.
class ProgramCounter(Elaboratable):
def __init__(self):
self.sync = ClockDomain("sync")
self.i_offset = SignedSignal(32)
self.o_instruction_address = Sign(32, reset=0x1000)
self.ports = [self.o_instruction_address]
def elaborate(self, _: Platform) -> Module:
m = Module()
m.d.sync += self.o_instruction_address.eq(self.o_instruction_address + self.i_offset)
return m
Load Retailer Unit
I’m co-simulating this half, so there is no such thing as a implementation. Additionally simulating even small quantities of reminiscence turned out to be technique to sluggish. I hope to search out extra time sooner or later to finish this a part of the venture. After which additionally run it with actual reminiscence on an FPGA as a substitute of simply in simulation.
However let’s a minimum of focus on essentially the most fascinating facet of reminiscence: Reminiscence is often very sluggish in comparison with the CPU. So the CPU needs to be stalled, i.e. wait, whereas we’re ready for the reminiscence to execute the learn or write. In my design I’ve outlined an o_done
sign. This sign tells the management unit to not advance this system counter till the result’s out there. Unsure if that is one of the best method, nevertheless it works for now.
class LoadStoreUnit(Elaboratable):
def __init__(self):
self.sync = ClockDomain("sync")
# from decoder, not in spec, inside management indicators
self.i_lsu_mode = Sign(2)
# from decoder, reminiscence operation
self.i_funct3 = Sign(3)
# tackle
self.i_address_base = Sign(32)
self.i_address_offset = SignedSignal(12)
# studying / writing
self.i_data = Sign(32)
self.o_data = Sign(32)
#
self.o_done = Sign(1)
def elaborate(self, _: Platform) -> Module:
m = Module()
# empty by design: that is co-simulated
return m
Tieing all of it collectively and testing it
The management unit connects all modules, as described within the simplified diagram above. And makes use of the management logic from the decoder to appropriately advance this system counter.
As an alternative of displaying the boring glue code, right here’s how I’m testing the CPU through simulation. The meeting program is designed to set registers to sure values, that may be checked afterwards. It doesn’t comply with any ABI constraints.
def test_cpu():
dut = CPU()
sim = Simulator(dut)
rom = [
0x02A00093, # addi x1 x0 42 --> x1 = 42
0x00100133, # add x2 x0 x1 --> x2 = 42
0x123451B7, # lui x3 0x12345 --> x3 = 0x12345
0x00208463, # beq x1 x2 8 --> skip the next instruction
0x00700193, # addi x3 x0 7 [skipped]
0x00424233, # xor x4 x4 x4 --> x4 = 0
0x00A00293, # addi x5 x0 10 --> x5 = 10
0x00120213, # addi x4 x4 1 --> x4 = x4 + 1
0x00520463, # beq x4 x5 8 --> skip the following instruction
0xFF9FF36F, # jal x6 -8 --> soar up; successfully setting x4 = 10
# additionally setting x6 = computer + 4
0x000013B7, # lui x7 0x1 --> x7 = 0x1000
0x03438467, # jalr x8 x7 52 --> skip the following instruction
0x00634333, # xor x6 x6 x6 [skipped]
0x100004B7, # lui x9 0x10000 --> x9 = 0x1000_0000
0x0324A503, # lw x10 50 x9 --> x10 = *((int32*)(mem_u8_ptr[x9 + 0x32]))
0x00000013, # nop
0,
]
ram = [0 for x in range(128)]
ram[0x32 + 3], ram[0x32 + 2], ram[0x32 + 1], ram[0x32] = 0xC0, 0xFF, 0xEE, 0x42
completed = [0]
def bench():
assert (yield dut.o_tmp_pc) == 0x1000
whereas True:
instr_addr = yield dut.o_tmp_pc
print("instr addr: ", hex(instr_addr))
rom_addr = (instr_addr - 0x1000) // 4
if rom[rom_addr] == 0:
completed[0] = 1
print("bench: completed.")
break
print("instr: ", hex(rom[rom_addr]))
yield dut.i_tmp_instruction.eq(rom[rom_addr])
yield Settle()
assert (yield dut.decoder.o_invalid) == False
yield Tick()
yield Settle()
read_reg = lambda x: dut.registers.registers.word_select(x, 32)
assert (yield read_reg(1)) == 42
assert (yield read_reg(2)) == 42
assert (yield read_reg(3)) == 0x12345000
assert (yield read_reg(5)) == 10
assert (yield read_reg(4)) == 10
assert (yield read_reg(6)) == 0x1000 + 4 * rom.index(0xFF9FF36F) + 4
assert (yield read_reg(7)) == 0x1000
assert (yield read_reg(8)) == 0x1000 + 4 * rom.index(0x03438467) + 4
assert (yield read_reg(9)) == 0x1000_0000
assert (yield read_reg(10)) == 0xC0FFEE42
yield Passive()
def memory_cosim():
lsu = dut.lsu
was_busy = False
whereas not completed[0]:
lsu_mode = yield lsu.i_lsu_mode
if lsu_mode == INTERNAL_LSU_MODE_DISABLED:
was_busy = False
yield lsu.o_data.eq(0)
yield lsu.o_done.eq(0)
elif lsu_mode == INTERNAL_LSU_MODE_LOAD and was_busy is False:
was_busy = True
base = yield lsu.i_address_base
offset = yield lsu.i_address_offset
addr = base + offset
funct3 = yield lsu.i_funct3
print(f"reminiscence learn request: addr={hex(addr)}")
yield Tick() # a learn takes some time
yield Tick()
yield Tick()
ram_offset = addr - 0x10000000
if funct3 == FUNCT3_LOAD_W:
worth = (ram[ram_offset + 3] << 24) | (ram[ram_offset + 2] << 16) | (ram[ram_offset + 1] << 8) | ram[ram_offset]
# ...
yield lsu.o_data.eq(worth)
yield lsu.o_done.eq(1)
# ...
yield Tick()
print("memory_cosim: completed.")
yield Passive()
sim.add_clock(1e-6)
sim.add_process(bench)
sim.add_process(memory_cosim)
sim.run()
My improvement time ran out earlier than I accomplished the venture, so no recreation of life on my toy CPU for now. So what’s lacking?
- reminiscence mapped I/O, in order that as a substitute of conserving the LEDs in reminiscence, indicators / pins of the CPU are used,
- including help for a number of lacking learn / write operations to the reminiscence controller (learn byte, write byte),
- integrating the emulator and simulator, re-using the prevailing debugger consumer interface,
- after which probably spending a while on debugging,
- possibly porting the simulator to Verilator or one other framework to make it quick sufficient.
However I assumed having a weblog publish is a lot better than finishing this experiment now. I hope to search out time sooner or later to once more work on this, lastly run recreation of life on my CPU and really run it in an FPGA. That may be enjoyable.
However one of the best half is de facto: I’ve discovered a lot as you’ve learn. Strive it your self. Thanks for studying 🙂
If you wish to be taught extra I’ve collected some hyperlinks that helped me beneath:
Please share feedback on