Now Reading
Naked metallic C on my RISC-V toy CPU · Florian Noeding’s weblog

Naked metallic C on my RISC-V toy CPU · Florian Noeding’s weblog

2023-01-26 10:21:06

I all the time wished to know how a CPU works, the way it transitions from one instruction to the following and makes a pc work. So after studying Ken Shirrif’s blog about a bug fix in the 8086 processor I assumed: Effectively, let’s attempt to write one in a {hardware} description language. This publish is a write up of my studying experiment.

I’ll stroll by means of my steps of making an emulator, compiling and linking C for naked metallic, CPU design and at last the implementation of my toy RISC-V CPU.

  • implement a CPU in a {hardware} description language (HDL),
  • code have to be synthesizable (besides reminiscence),
  • simulate it,
  • and run a naked metallic C program on it.

Whereas I had loads analysis time, I had solely about 30 hours of improvement time. With out prior {hardware} design expertise the objectives needed to be easy sufficient:

  • RISC V Integer instruction set solely (minus system and break calls),
  • no interrupt dealing with or different complexities,
  • no or solely minimal optimizations.

My check program written in C is Conway’s recreation of life, strive the interactive web version.

In a future iteration I plan to run my venture on an actual FPGA and implement a reminiscence controller. Let’s say I made it to about 85% of what I wished to attain.

Writing an emulator, i.e. a program that may execute the directions of a CPU, is a wonderful stepping stone in the direction of a {hardware} implementation. For me – with out a {hardware} background – it’s a lot simpler to cause about and be taught the instruction set.

So my first step was to know the RISC V instruction set. The RISC V specification is kind of lengthy, however I solely wanted chapters 2 (integer instruction set), 19 (RV32 / 64G instruction set listings) and 20 (meeting programmer’s handbook). These give detailed definitions of how every instruction have to be executed, what sort of registers have to be applied, and so forth.

Let’s take a look at an instance: Arithmetic / logical directions working on a register and an instantaneous worth. For ADDI (add rapid) the next is completed: rd <- rs1 + imm: The register recognized by rd is ready to the sum of the worth saved in register rs1 and the worth given within the instruction itself.

Integer Register-Immediate Instructions

The emulator is applied in C++ with a C-style interface in a shared library. This makes it straightforward to hook it into Python through cffi. This saved me fairly a little bit of time for file system and consumer interactions, which had been all completed in Python.

struct RV32EmuState;

extern "C" RV32EmuState* rv32emu_init();
extern "C" void rv32emu_free(RV32EmuState *state);

extern "C" void rv32emu_set_rom(RV32EmuState *state, void *knowledge, uint32_t measurement);
extern "C" void rv32emu_read(RV32EmuState *state, uint32_t addr, uint32_t measurement, void *p);

extern "C" int32_t rv32emu_step(RV32EmuState *state);
extern "C" int32_t rv32emu_run(RV32EmuState *state, uint32_t *breakpoints, uint32_t num_breakpoints);

extern "C" void rv32emu_print(RV32EmuState *state);

The core of the emulator is the rv32emu_step operate, which executes precisely one instruction after which returns. In RISC V every instruction is strictly 32 bits. It first decodes the op code (what sort of operation) after which the precise operation (e.g. ADD rapid). It’s a big, nested change.

int32_t rv32emu_step(RV32EmuState *state) {
    uint32_t *instr_p = (uint32_t*)(_get_pointer(state, state->computer));
    uint32_t instr = *instr_p;

    // decode
    uint32_t opcode = instr & 0x7F; // instr[6..0]

    change(opcode) {
        //
        // ... all opcode varieties ...
        //
        case OPCODE_OP_IMM: {
            uint32_t funct3 = (instr >> 12) & 0x07; // instr[14..12]
            uint32_t rs1 = ((instr >> 15) & 0x1F);

            uint32_t imm = (instr >> 20) & 0x0FFF; // 12 bits
            uint32_t imm_sign = instr & (1ul << 31);
            if(imm_sign) = 0xFFFFF000;
            

            uint32_t data1 = state->reg[rs1];
            uint32_t data2 = imm;
            uint32_t reg_dest = (instr >> 7) & 0x1F;
            if(reg_dest != 0) { // register 0 is all the time zero and by no means written too
                change(funct3) {
                    //
                    // ... all OP_IMM directions ...
                    //
                    case FUNCT3_OP_ADD: {
                        state->reg[reg_dest] = data1 + data2;
                        break;
                    }
                }
            }
            break;
        }

    // ...

    state->computer += 4; // go to subsequent instruction for non-jump / non-branch
    return 0;
}

The mathematical and logical operations are easiest to implement, so I began with them. Iteratively I’ve added branches, jumps and the remaining logic till I had lined all directions aside from ECALL and EBREAK. These two weren’t obligatory for my naked metallic experiment.

For testing I relied on easy hand-written meeting code. After all this didn’t train my emulator totally. In order a subsequent step I wished to lastly run my Conway’s recreation of life simulation.

Going from C to a naked metallic CPU takes a number of steps: cross compile, guarantee correct reminiscence structure and changing the ELF file to a binary blob. Additionally as a substitute of getting a most important operate my code has a _start operate outlined as follows:

void __attribute__((part (".textual content.boot"))) _start() {
    run(); // name the precise "entrypoint"
}

I’ll clarify the small print later.

My CPU solely helps the RISC-V 32 bit integer instruction set, however my host system is operating on x86-64. So I wanted a cross compiler and used the Ubuntu package deal gcc-riscv64-unknown-elf. Then I may compile my code utilizing the next command:

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 
    -nostdlib -ffreestanding -Tprograms/memory_map.ld 
    -o life.rv32.elf life.c

Let’s take this aside:

  1. execute the RISC-V cross-compiler
  2. set it’s structure to rv32i, which is RISC-V 32bit integer instruction set
  3. outline the applying binary interface, i.e. conventions tips on how to emit meeting. This makes it in order that integers, longs and pointers are 32 bit
    • these three are wanted to emit code appropriate with my emulator and later CPU
  4. compile with out a normal library
    • Normal libraries just like the system libc assume working system help, however my toy CPU can be operating naked metallic. So we’ve to modify that off. This implies we received’t have entry to malloc, printf, places and so forth. As an alternative we’ll must implement this ourselves, if we need to use it. This implies we’ve no startup code both.
  5. compile freestanding, that’s, don’t assume presence of a working system or library and change of libc particular optimizations and defaults
    • for instance we received’t have a most important operate, which is in any other case required
  6. use a reminiscence map
    • we have to inform the compiler and linker the place directions and world variables can be positioned in reminiscence. We do not need a loader to do that at utility startup

Despite the fact that we don’t but have any {hardware}, we should make a number of selections for merchandise 6: how ought to our tackle area seem like?

  • program execution begins at 0x1000, which beneath I’ll name rom for read-only-memory
  • reminiscence for globals, stack variables and heap can be situated at 0x10000000

These values are sort of arbitrary. I wished to keep away from having code at tackle zero to keep away from points with NULL pointers. This script additionally ensures that our program entry level, the operate _start is positioned at 0x1000, in order that the emulator will execute that code first. Right here’s my linker script for my tackle area setup:

ENTRY(_start)

MEMORY
{
    rom (rx ): ORIGIN = 0x00001000, LENGTH = 16M
    ram (rw): ORIGIN = 0x10000000, LENGTH = 32M
}

SECTIONS
{
    .textual content : {
        /*
            entry level is predicted to be the primary operate right here
            --> we're assuming there's solely a single operate within the .textual content.boot phase and by conference that's "_start"

            KEEP ensures that "_start" is stored right here, even when there are not any references to it
        */
        KEEP(*(.textual content.boot))

        /*
            all different code follows
        */
        *(.textual content*)
    } > rom

    .rodata : { *(.rodata*) } > rom

    .bss : { *(.bss*) } > ram
}

After compilation we are able to examine that _start is definitely at 0x1000:

riscv64-unknown-elf-readelf -s life.rv32.elf | grep '_start$$'

Now the “drawback” is that gcc generates an ELF and never only a stream of directions. The Executable and Linkable Format is simplified a container to retailer executable code, knowledge and metadata in a approach that makes it straightforward to later load into reminiscence. As specified by the reminiscence map just like the one above. Since my program is pretty it easy doesn’t want reminiscence initialization. So we are able to merely dump the RISC-V directions from the .textual content phase right into a binary file.

riscv64-unknown-elf-objdump -O binary life.rv32.elf life.rv32.bin

So now the C snippet from above ought to make extra sense:

void __attribute__((part (".textual content.boot"))) _start() {
    run();
}

We’re defining a operate _start which ought to go into the phase .textual content.boot. The linker script instructs the toolchain to ensure this code is positioned at 0x1000, even when no different code references it. By having precisely one operate in .textual content.boot that is assured to occur.

Seems that is nonetheless not sufficient to make the code work. The startup code above doesn’t initialize the stack pointer, i.e. the place native variables reside in reminiscence. I made a decision to simplify issues and hard-code the preliminary stack pointer worth in my emulator and CPU. This implies merely setting register x2 also referred to as sp to the tip of the reminiscence, right here 0x12000000.

A pair different registers outlined within the ABI with particular goal are usually not utilized by my program, so I didn’t implement help: world pointer gp and thread pointer tp.

When this system is operating on my host I depend on the usual library for reminiscence allocation like malloc or putchar for output. However when operating naked metallic these features are usually not out there.

I’ve changed dynamic reminiscence allocation with static reminiscence assignments. Since my program is the one one operating on CPU, I can use all sources how I see match. If the flag FREESTANDING is ready, when this system is compiled for my RISC-V emulator / CPU. With out it, this system can run as-is on my host system like every other program.

void run() {
    #ifdef FREESTANDING
        map0 = (unsigned char*)0x10080000;                              // gamestate
        map1 = map0 + 0x80000;                                          // new gamestate
        leds = map1 + 0x80000;                                          // output for emulator
    #else
        map0 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
        map1 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
    #endif

    // ...
}

As an alternative of counting on putchar for output to the console, my program assumes that the tackle of the variable leds is memory-mapped to an exterior LED array. In case of the emulator, it is going to merely learn this reminiscence space and show it on console. When operating within the simulator (or FPGA within the subsequent iteration), the reminiscence controller will set output pins accordingly.

Right here’s the results of all of that work: First setting a breakpoint for every recreation of life cycle, after which manually stepping by means of this system on the emulated CPU.

Finest seen in full-screen mode as a consequence of web-unfriendly structure.

With the emulator accomplished I now have a working reference system to debug my CPU. And so I began working implementing it.

A easy CPU consists of the next parts:

  • Arithmetic Logic Unit (ALU): the compute half, for operations like “add” or “xor”
  • Register File: supplies and shops register values
  • Decoder: remodel instruction to a set of management indicators, controlling the CPU operation
  • Program Counter: manages the tackle the place the following instruction is discovered
  • Load Retailer Unit (LSU): connects the CPU to its reminiscence
  • Management Unit: tieing all of the components collectively to type a CPU

These are the fundamental parts of a CPU and ample for my toy RISC-V implementation.

{Hardware} is designed with particular programming languages, referred to as {hardware} description languages (HDL). The most typical ones are Verilog and VHDL. For my venture I made a decision to make use of Amaranth HDL, as a result of it’s increased degree and simpler to make use of – plus it’s written in my favourite language Python. Simplified it allows an engineer to explain a program in Python that generates a {hardware} description, as a substitute of immediately describing it immediately in Verilog or VHDL. A pleasant property of Amaranth HDL is that by design the ensuing applications are synthesizable, i.e. they are often “compiled” into an outline executable in FPGAs or constructed as an ASIC.

A key distinction between software program and {hardware} is concurrency: In software program code is executed line by line, so as and we’d like particular constructs like threads to attain parallelism. In {hardware} it’s completely different: All the pieces is going on on the similar time. We aren’t describing high-level operations, however quite how logic gates are related to one another.

Combinational logic