Now Reading
Writing a C Compiler, Half 1

Writing a C Compiler, Half 1

2023-06-08 15:08:18

That is the primary submit in a collection on writing your personal C compiler. Listed here are some causes to jot down a compiler:

  1. You’ll find out about summary syntax bushes (ASTs) and the way applications can symbolize and manipulate different applications. Helpful for working with linters, static analyzers, and metaprogramming of all types.
  2. You’ll find out about meeting, calling conventions, and all of the gritty, low-level particulars of how computer systems, like, do stuff.
  3. It looks as if an impossibly arduous challenge (however isn’t!), so writing one will make you are feeling like a badass.

I’ve been working by myself C compiler, nqcc for the previous a number of weeks, utilizing Abdulaziz Ghuloum’s An Incremental Approach to Compiler Construction as a roadmap. I actually like Ghuloum’s strategy: you begin by compiling a tiny, trivial subset of your supply language all the way in which right down to x86 assembly. Then you definately add new language options, one step at a time. In the first step, you simply return constants; in a later step you deal with addition and subtraction; and so forth. Each step is sufficiently small to really feel manageable, and on the finish of the each step you may have a working compiler.

This collection is tailored from Ghuloum’s paper – the unique paper is about compiling Scheme, so I needed to make some changes to compile C as a substitute. I’ll cowl arithmetic operations, conditionals, native variables, operate calls, and maybe extra. I’ve additionally written some test programs that you should use to validate that every stage of your compiler works appropriately.

Earlier than you begin, you could resolve on two issues: what language to jot down your compiler in, and easy methods to deal with parsing and lexing. You’ll be able to implement the compiler in no matter language you want, however I’d advocate utilizing a language with sum sorts and sample matching1, like OCaml, Haskell, or Rust. It will likely be SO MUCH EASIER to construct and traverse an AST in case you do. I began writing nqcc in Python, which I do know very properly, then obtained fed up and switched to OCaml, which I didn’t know properly in any respect, and it was undoubtedly value it.

You additionally must resolve whether or not to jot down your personal parser and lexer or use automated parser and scanner turbines (e.g. flex and bison). On this collection of posts, I’ll present you easy methods to write a lexer (or scanner) and recursive descent parser by hand. Utilizing a parser generator might be simpler, however I haven’t tried it so I may very well be mistaken. You could possibly in all probability additionally use a scanner generator for lexing, however hand-write your personal parser. Mainly, do no matter you want, however I’m solely going to speak about hand-writing a lexer and parser for the remainder of this collection, so if you wish to use bison and flex you’re by yourself.

Replace 2/18/19

There’s another factor you could resolve on: whether or not to focus on 32-bit or 64-bit meeting. This collection makes use of 32-bit structure as a result of that’s what Ghuloum’s paper used. Nevertheless, I’ve realized since beginning the collection that this was a nasty name. As a result of 32-bit structure is more and more out of date, compiling and working 32-bit binaries could be a headache. I’ve determined to return and add 64-bit examples to this collection after I get the possibility. Till I try this, you may have one in every of two choices:

  1. Determine by yourself easy methods to adapt these posts to a 64-bit instruction set. In case you’re in any respect conversant in meeting, this isn’t too arduous and it’s what I’d advocate.
  2. Persist with the 32-bit instruction set I’ve utilized in these posts. This may require just a little additional work up entrance, relying in your OS:

    • On Linux, you’ll want to put in some additional libraries with a purpose to flip your 32-bit meeting into an executable. This Dockerfile lists the libraries you’ll want (plus some Scheme-related stuff you may ignore). Many due to Jaseem Abid, who had beforehand labored via Ghuloum’s paper, for creating this Dockerfile and telling me about it.

    • 32-bit assist is being phased out on macOS, and the subsequent model in all probability gained’t allow you to run 32-bit binaries in any respect. In the mean time, the gcc binary that ships with XCode gained’t compile 32-bit functions by default. You’ll be able to simply set up GCC from Homebrew and use that as a substitute, or you may futz round with XCode and work out easy methods to make it construct 32-bit binaries. I went with the previous.

This week, we’ll compile a program that returns a single integer. We’ll additionally arrange the three fundamental passes of our compiler. This can be a variety of work for not that a lot payoff, however the structure we outline now will make it simple so as to add extra language options afterward.

Right here’s a program we’d wish to compile – we’ll name it return_2.c:

We’ll solely deal with applications with a single operate, major, consisting of a single return assertion. The one factor that varies is the worth of the integer being returned. We gained’t deal with hex or octal integer literals, simply decimal. To confirm that your compiler works appropriately, you’ll must compile a program, run it, and test its return code:

$ ./YOUR_COMPILER return_2.c # compile the supply file proven above
$ ./gcc -m32 return_2.s -o return_2 # assemble it into an executable
$ ./return_2 # run the executable you simply compiled
$ echo $? # test the return code; it ought to be 2
2 

Your compiler will produce x86 meeting. We gained’t remodel the meeting into an executable ourselves – that’s the job of the assembler and linker, that are separate applications2. To see how this program seems to be in meeting, let’s compile it with gcc3:

$ gcc -S -O3 -fno-asynchronous-unwind-tables return_2.c
$ cat return_2.s
    .part __TEXT,__text_startup,common,pure_instructions
    .align 4
    .globl _main
_main:
    movl    $2, %eax
    ret
    .subsections_via_symbols

Now, let’s have a look at the meeting itself. We are able to ignore the .part, .align and .subsections_via_symbols directives – in case you delete them, you may nonetheless assemble and run return_2.s4. .globl _main signifies that the _main image ought to be seen to the linker; in any other case it might probably’t discover the entry level to this system. (In case you’re on a Unix-like system apart from OS X, this image will simply be major, no underscore.)

Lastly, we’ve our precise meeting directions:

_main:                  ; label for begin of "major" operate
    movl    $2, %eax    ; transfer fixed "2" into the EAX register
    ret                 ; return from operate

A very powerful level right here is that when a operate returns, the EAX register5 will comprise its return worth. The major operate’s return worth would be the program’s exit code.

An necessary facet notice: all through this tutorial, I’ll use AT&T meeting syntax, as a result of that’s what GCC makes use of by default. Some on-line sources may use Intel syntax, which has operands within the reverse order from AT&T syntax. Everytime you’re studying meeting, be sure you know what syntax it’s utilizing!

The one factor that may change within the snippet of meeting above is the return worth. So one quite simple strategy can be to make use of a daily expression to extract the return worth from the supply code, then plug it into the meeting. Right here’s a 20-line Python script to do this:

import sys, os, re

#anticipated type of a C program, with out line breaks
source_re = r"int mains*(s*)s*{s*returns+(?P<ret>[0-9]+)s*;s*}" 

# Use 'major' as a substitute of '_main' if not on OS X
assembly_format = """    
    .globl _main
_main:
    movl    ${}, %eax
    ret
"""

source_file = sys.argv[1]
assembly_file = os.path.splitext(source_file)[0] + ".s"

with open(source_file, 'r') as infile, open(assembly_file, 'w') as outfile:
    supply = infile.learn().strip()
    match = re.match(source_re, supply)

    # extract the named "ret" group, containing the return worth
    retval = match.group('ret') 
    outfile.write(assembly_format.format(retval))

However parsing the entire program with one massive common expression isn’t a viable long-term technique. As a substitute, we’ll break up up the compiler into three phases: lexing, parsing, and code technology. So far as I do know, it is a fairly customary compiler structure, besides you’d usually need a bunch of optimization passes between parsing and code technology.

Lexing

The lexer (additionally known as the scanner or tokenizer) is the part of the compiler that breaks up a string (the supply code) into a listing of tokens. A token is the smallest unit the parser can perceive – if a program is sort of a paragraph, tokens are like particular person phrases. (Many tokens are particular person phrases, separated by whitespace.) Variable names, key phrases, and constants, and punctuation like braces are all examples of tokens. Right here’s a listing of all of the tokens in return_2.c:

  • int key phrase
  • Identifier “major”
  • Open parentheses
  • Shut parentheses
  • Open brace
  • return key phrase
  • Fixed “2”
  • Semicolon
  • Shut brace

Notice that some tokens have a worth (e.g. the fixed token has worth “2”) and a few don’t (like parentheses and braces). Additionally notice that there are not any whitespace tokens. (In some languages, like Python, whitespace is important and also you do want tokens to symbolize it.)

Listed here are all of the tokens your lexer wants to acknowledge, and the common expression defining every of them:

  • Open brace {
  • Shut brace }
  • Open parenthesis (
  • Shut parenthesis )
  • Semicolon ;
  • Int key phrase int
  • Return key phrase return
  • Identifier [a-zA-Z]w*
  • Integer literal [0-9]+

If you would like, you may simply have a “key phrase” token sort, as a substitute of a special token sort for every key phrase.

☑ Activity:

Write a lex operate that accepts a file and returns a listing of tokens. It ought to work for all stage 1 examples within the check suite, together with the invalid ones. (The invalid examples ought to elevate errors within the parser, not the lexer.) To maintain issues easy, we solely lex decimal integers. In case you like, you may lengthen your lexer to deal with octal and hex integers too.

You may discover that we will’t lex unfavourable integers. That’s not an accident – C doesn’t have unfavourable integer constants. It simply has a negation operator, which may be utilized to optimistic integers. We’ll add negation within the subsequent submit.

Parsing

The following step is reworking our checklist of tokens into an summary syntax tree. An AST is one strategy to symbolize the construction of a program. In most programming languages, language constructs like conditionals and performance declarations are made up of less complicated constructs, like variables and constants. ASTs seize this relationship; the basis of the AST would be the whole program, and every node can have kids representing its constituent elements. Let’s have a look at a small instance:

if (a < b) {
    c = 2;
    return c;
} else {
    c = 3;
}

This code snippet is an if assertion, so we’ll label the basis of our AST “if assertion”. It would have three kids:

  • The situation (a < b)
  • The if physique (c = 2; return c;)
  • The else physique (c = 3;)

Every of those elements may be damaged down additional. For instance, the situation is a binary < operation with two kids:

  • The primary operand (variable a)
  • The second operand (variable b)

An task assertion (like c=2;) additionally has two kids: the variable being up to date (c), and the expression assigned to it (2).

The if physique, then again, can have an arbitrary variety of kids – every assertion is a toddler node. On this case it has two kids as a result of there are two statements. The youngsters are ordered – c=2; precedes return c; as a result of it comes first within the supply code.

Right here’s the total AST for this code snippet:

Image of diagram; text outline follows

  • if assertion
    • situation: binary operation (<)
      • operand 1: variable a
      • operand 2: variable b
    • if physique: assertion checklist
      • assertion 1: task
        • variable: c
        • right-hand facet: fixed 2
      • assertion 2: return
    • else physique: assertion checklist
      • assertion 1: task
        • variable: c
        • right-hand facet: fixed 3

And right here’s pseudocode for establishing this AST:

//create if situation
cond = BinaryOp(op='>', operand_1=Var(a), operand_2=Var(b))

//create if physique
assign = Project(var=Var(c), rhs=Const(2))
return = Return(val=Var(c))
if_body = [assign, return]

//create else physique
assign_else = Project(var=Var(c), rhs=Const(3))
else_body = [assign_else]

//assemble if assertion
if = If(situation=cond, physique=if_body, else=else_body)

For now, although, we don’t want to fret about conditionals, variable assignments, or binary operators. Proper now, the one AST nodes we have to assist are applications, operate declarations, statements, and expressions. Right here’s how we’ll outline every of them:

program = Program(function_declaration)
function_declaration = Perform(string, assertion) //string is the operate title
assertion = Return(exp)
exp = Fixed(int) 

Proper now, a program consists of a single operate, major. In a while we’ll outline a program as a listing of features. A operate has a reputation and a physique. Later, a operate may also have a listing of arguments. In an actual C compiler, we’d additionally must retailer the operate’s return sort, however proper now we solely have integer sorts. A operate physique is a single assertion; later will probably be a listing of statements. There’s just one sort of assertion: a return assertion. Later we’ll add different kinds of statements, like conditionals and variable declarations. A return assertion has one youngster, an expression – that is the worth being returned. For now an expression can solely be an integer fixed. Later we’ll let expressions embody arithmetic operations, which can permit us to parse statements like return 2+2;.

As we add new language constructs, we’ll replace the definitions of our AST nodes. For instance, we’ll ultimately add a brand new sort of assertion: variable task. Once we do, we’ll add a brand new type to our assertion definition:

assertion = Return(exp) | Assign(variable, exp)

Right here’s a diagram of the AST for return_2.c:

Image of diagram; text outline follows

Lastly, we want a proper grammar, which defines how collection of tokens may be mixed to type language constructs. We’ll outline it right here in Backus-Naur Form:

<program> ::= <operate>
<operate> ::= "int" <id> "(" ")" "{" <assertion> "}"
<assertion> ::= "return" <exp> ";"
<exp> ::= <int>

Every of the traces above is a manufacturing rule, defining how a language assemble may be constructed from different language constructs and tokens. Each image that seems on the left-hand facet of a manufacturing rule (i.e. <program>, <operate>, <assertion>) is a non-terminal image. Particular person tokens (key phrases, ids, punctuation, and so on.) are terminal symbols. Notice that, whereas this grammar tells us what sequence of tokens constitutes a legitimate C program, it doesn’t inform us precisely easy methods to remodel that program into an AST – for instance, there’s no manufacturing rule comparable to the Fixed node within the AST. We might rewrite our grammar to have a manufacturing rule for constants, however we don’t must with a purpose to parse this system.

Proper now the grammar is very simple; there’s just one manufacturing rule for every non-terminal image. Later, some non-terminal symbols can have a number of manufacturing guidelines. For instance, if we added assist for variable declarations, we might have the next rule for deriving statements:

<assertion> ::= "return" <int> ";" | "int" <id> "=" <int> ";"

To remodel a listing of tokens into an AST, we’ll use a way known as recursive descent parsing. We’ll outline a operate to parse every non-terminal image within the grammar and return a corresponding AST node. The operate to parse image S ought to take away tokens from the beginning of the checklist till it reaches a legitimate derivation of S. If, earlier than it’s carried out parsing, it hits a token that isn’t within the manufacturing rule for S, it ought to fail. If the rule for S accommodates different non-terminals, it ought to name different features to parse them.

Right here’s the pseudocode for parsing an announcement:

def parse_statement(tokens):
    tok = tokens.subsequent()
    if tok.sort != "RETURN_KEYWORD":
        fail()
    tok = tokens.subsequent()
    if tok.sort != "INT"
        fail()
    exp = parse_exp(tokens) //parse_exp will pop off extra tokens
    assertion = Return(exp)

    tok = tokens.subsequent()
    if tok.sort != "SEMICOLON":
        fail()

    return assertion

Later, the manufacturing guidelines can be recursive (e.g. an arithmetic expression can comprise different expressions), which implies the parsing features can be recursive too – therefore the title recursive descent parser.

☑ Activity:

Write a parse operate that accepts a listing of tokens and returns an AST, rooted at a Program node. The operate ought to construct the proper AST for all legitimate stage 1 examples, and lift an error on all invalid stage 1 examples. If you would like, you may as well have your parser fail gracefully if it encounters integers above your system’s INT_MAX.

There are a variety of methods to symbolize an AST in code – every sort of node may very well be its personal class or its personal datatype, relying on what language you’re writing your compiler in. For instance, right here’s the way you may outline AST nodes as OCaml datatypes:

sort exp = Const(int)
sort assertion = Return(exp)
sort fun_decl = Enjoyable(string, assertion)
sort prog = Prog(fun_decl)

Code Technology

Now that we’ve constructed an AST, we’re able to generate some meeting! Like we noticed earlier than, we solely must emit 4 traces of meeting. To emit it, we’ll traverse the AST in roughly the order that this system executes. Which means we’ll go to, so as:

  • The operate title (probably not a node, however the very first thing within the operate definition)
  • The return worth
  • The return assertion

Notice that we regularly (although not all the time) traverse the tree in post-order, visiting a toddler earlier than its mum or dad. For instance, we have to generate the return worth earlier than it’s referenced in a return assertion. In later posts, we’ll must generate the operands of arithmetic expressions earlier than producing the code that operates on them.

Right here’s the meeting we want:

  1. To generate a operate (e.g. operate “foo”):
     .globl _foo
    _foo:
     <FUNCTION BODY GOES HERE>
    
  2. To generate a return assertion (e.g. return 3;):

☑ Activity:

Write a generate operate that accepts an AST and generates meeting. It could possibly return the meeting as a string or write it on to a file. It ought to generate appropriate meeting for all legitimate stage 1 examples.

(Optionally available) Fairly printing

You’ll in all probability need a utility operate to print out your AST, to assist with debugging. You’ll be able to write it now, or wait till you want it. Right here’s what nqcc’s fairly printer outputs for return_2.c:

FUN INT major:
    params: ()
    physique:
        RETURN Int<2>

This instance contains some data your AST doesn’t want, just like the return sort and checklist of operate parameters.

☑ Activity:

Write a pretty-print funcion that takes an AST and prints it out in a readable approach.

Placing all of it collectively

☑ Activity:

Write a program that accepts a C supply file and outputs an executable. This system ought to:

  1. Learn within the file
  2. Lex it
  3. Parse it
  4. Generate meeting
  5. Write the meeting to a file
  6. Invoke GCC command to transform the meeting to an executable:
    gcc -m32 meeting.s -o out
    

    On this command, “meeting.s” is the title of the meeting file and “out” is the title of the executable you wish to generate. The -m32 choice tells GCC to construct a 32-bit binary. You’ll be able to omit that choice and construct 64-bit binaries if you would like, however you’ll must make some adjustments to the code technology steps afterward (e.g. utilizing 64-bit registers).

  7. (Optionally available) Delete the meeting file.

Testing

You’ll be able to check that your compiler is working correctly with the check script here. It would compile a set of check applications utilizing your compiler, execute them, and ensure they return the correct worth.

To invoke it:

./test_compiler.sh /path/to/your/compiler

In an effort to check it with the script, your compiler must observe this spec:

  1. It may be invoked from the command line, taking solely a C supply file as an argument, e.g.:
    ./YOUR_COMPILER /path/to/program.c
  2. When handed program.c, it generates executable program in the identical listing.
  3. It doesn’t generate meeting or an executable if parsing fails (that is what the check script checks for invalid check applications).

The script doesn’t test whether or not your compiler outputs wise error messages, however you should use the invalid check applications to check that manually.

Up Subsequent

Within the next post, we’ll add three unary operators: -, ~, and !. Keep tuned!

If in case you have any questions, corrections, or different suggestions, you may email me or open an issue.

Additional Studying

  • Baby Steps to a C Compiler – a submit about one other C compiler impressed by Ghuloum’s paper.
  • The C11 Standard, the present C language specification. Annex A is a abstract of C’s grammar, so it’s a superb reference for parsing. You in all probability don’t must learn this all through.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top