Now Reading
Let’s make a Teeny Tiny compiler, half 1

Let’s make a Teeny Tiny compiler, half 1

2023-05-28 04:25:33


Let’s make a Teeny Tiny compiler, half 1 – Austin Z. Henley









Austin Z. Henley

I work on AI + dev instruments.



5/5/2020

That is the primary publish in a 3 half sequence. Take a look at part 2 and part 3 when you find yourself prepared.

It’s a lovely day exterior, so let’s make a compiler. You do not want any data of how compilers work to comply with alongside. We’re going to use Python to implement our personal programming language, Teeny Tiny, that may compile to C code. It is going to take about 500 strains of code and supply the preliminary infrastructure wanted to customise and lengthen the compiler into your personal billion greenback production-ready compiler.

This tutorial is a sequence of posts that go step-by-step in constructing a working compiler. All the supply code will be discovered within the GitHub repo. When you comply with together with all of the posts, I guesstimate that it’s going to take you only some hours.

The Teeny Tiny language we’re going to implement is a dialect of BASIC. The syntax is clear and easy. When you desire a C-like syntax then it will likely be trivial to switch the compiler on the finish. Right here is an instance program in our Teeny Tiny language:

PRINT "What number of fibonacci numbers would you like?"
INPUT nums

LET a = 0
LET b = 1
WHILE nums > 0 REPEAT
    PRINT a
    LET c = a + b
    LET a = b
    LET b = c
    LET nums = nums - 1
ENDWHILE	

This program prints out phrases of the fibonacci sequence based mostly on the consumer’s enter: 0 1 1 2 3 5 8 13…

Our language will permit quite a lot of the essential operations that you simply’d anticipate from a programming language. Particularly, it’s going to help:

  • Numerical variables
  • Primary arithmetic
  • If statements
  • Whereas loops
  • Print textual content and numbers
  • Enter numbers
  • Labels and goto
  • Feedback

Though it is a commonplace subset of options, you might discover that there aren’t any features, no arrays, no method to learn/write from a file, and never even an else assertion. However with simply this small set of constructs, you may truly do so much. It is going to additionally setup the compiler in such a means that many different options will likely be straight ahead so as to add later.

Compiler Overview

Our compiler will comply with a 3 step course of that’s illustrated above. First, given the inputted supply code, it’s going to break the code up into tokens. These are like phrases and punctuation in English. Second, it’s going to parse the tokens to ensure they’re in an order that’s allowed in our language. Similar to English sentences comply with particular buildings of verbs and nouns. Third, it’s going to emit the C code that our language will translate to.

We are going to use these three steps as the primary group for our code. The lexer, parser, and emitter will every have their very own Python code file. This tutorial is damaged up into 3 elements based mostly on these steps as nicely. When you had been to increase the compiler, there are some extra steps you’ll add, however we are going to maintain off on discussing these.

Lexer Overview

The primary module of our compiler is known as the lexer. Given a string of Teeny Tiny code, it’s going to iterate character by character to do two issues: resolve the place every token begins/stops and what sort of token it’s. If the lexer is unable to do that, then it’s going to report an error for an invalid token.

The determine demonstrates an instance enter and output of the lexer. Given the Teeny Tiny code, the lexer should decide the place the tokens are together with the kind (e.g., key phrase). You may see that areas aren’t acknowledged as tokens, however the lexer will use them as one method to know when a token ends.

Let’s lastly get into some code, beginning with the construction of the lexer within the file lex.py:

class Lexer:
    def __init__(self, supply):
        go

    # Course of the subsequent character.
    def nextChar(self):
        go

    # Return the lookahead character.
    def peek(self):
        go

    # Invalid token discovered, print error message and exit.
    def abort(self, message):
        go
		
    # Skip whitespace besides newlines, which we are going to use to point the top of a press release.
    def skipWhitespace(self):
        go
		
    # Skip feedback within the code.
    def skipComment(self):
        go

    # Return the subsequent token.
    def getToken(self):
        go

I prefer to sketch out all of the features that I feel I’ll want, then return and fill them in. The operate getToken would be the meat of the lexer. Will probably be known as every time the compiler is prepared for the subsequent token and it’ll do the work of classifying tokens. nextChar and peek are helper features for trying on the subsequent character. skipWhitespace consumes the areas and tabs that we do not care about. abort is what we are going to use to report an invalid token.

The lexer must maintain monitor of the present place within the enter string and the character at that place. We are going to initialize these within the constructor:

    def __init__(self, supply):
        self.supply = supply + 'n' # Supply code to lex as a string. Append a newline to simplify lexing/parsing the final token/assertion.
        self.curChar=""   # Present character within the string.
        self.curPos = -1    # Present place within the string.
        self.nextChar()

The lexer wants the enter code and we append a newline to it (this simply simplifies some checks afterward). curChar is what the lexer will continually verify to resolve what sort of token it’s. Why not simply do supply[curPos]? As a result of that may litter the code with bounds checking. As a substitute we do that in nextChar:

    # Course of the subsequent character.
    def nextChar(self):
        self.curPos += 1
        if self.curPos >= len(self.supply):
            self.curChar=""  # EOF
        else:
            self.curChar = self.supply[self.curPos]

This increments the lexer’s present place and updates the present character. If we attain the top of the enter, set the character to the end-of-file marker. That is the one place we are going to modify curPos and curChar. However generally we wish to sit up for the subsequent character with out updating curPos:

    # Return the lookahead character.
    def peek(self):
        if self.curPos + 1 >= len(self.supply):
            return ''
        return self.supply[self.curPos+1]

We should always make sure that these features work. Let’s take a look at them by create a brand new file teenytiny.py:

from lex import *

def principal():
	supply = "LET foobar = 123"
	lexer = Lexer(supply)

	whereas lexer.peek() != '':
		print(lexer.curChar)
		lexer.nextChar()

principal()

Run this and the output must be each character of the enter string, LET foobar = 123, on a brand new line:

L
E 
T 

f 
o 
o 
b 
a 
r 

= 

1 
2 
3 

Classifying Tokens

However we do not simply need characters, we wish tokens! We have to plan how combining particular person characters collectively makes a token, which works very like a state machine. Listed here are the primary lexer guidelines for the Teeny Tiny language:

  • Operator. One or two consecutive characters that matches: + – * / = == != > < >= <=
  • String. Double citation adopted by zero or extra characters and a double citation. Akin to: “good day, world!” and “”
  • Quantity. A number of numeric characters adopted by an optionally available decimal level and a number of numeric characters. Akin to: 15 and three.14
  • Identifier. An alphabetical character adopted by zero or extra alphanumeric characters.
  • Key phrase. Actual textual content match of: LABEL, GOTO, PRINT, INPUT, LET, IF, THEN, ENDIF, WHILE, REPEAT, ENDWHILE

Subsequent we are going to begin our getToken operate in our Lexer class:

    # Return the subsequent token.
    def getToken(self):
        # Verify the primary character of this token to see if we are able to resolve what it's.
        # If it's a a number of character operator (e.g., !=), quantity, identifier, or key phrase then we are going to course of the remaining.
        if self.curChar == '+':
            go	# Plus token.
        elif self.curChar == '-':
            go	# Minus token.
        elif self.curChar == '*':
            go	# Asterisk token.
        elif self.curChar == '/':
            go	# Slash token.
        elif self.curChar == 'n':
            go	# Newline token.
        elif self.curChar == '':
            go	# EOF token.
        else:
            # Unknown token!
            go
			
        self.nextChar()

It will detect a couple of potential tokens, however does not do something helpful but. What we want subsequent is a Token class to maintain monitor of what sort of token it’s and the precise textual content from the code. Place this in lex.py for now:

# Token incorporates the unique textual content and the kind of token.
class Token:   
    def __init__(self, tokenText, tokenKind):
        self.textual content = tokenText   # The token's precise textual content. Used for identifiers, strings, and numbers.
        self.form = tokenKind   # The TokenType that this token is classed as.

To specify what sort a token is, we are going to create the TokenType class as an enum. It seems lengthy, nevertheless it simply specifies each potential token our language permits. Add import enum to the highest of lex.py and add this class:

# TokenType is our enum for all of the kinds of tokens.
class TokenType(enum.Enum):
	EOF = -1
	NEWLINE = 0
	NUMBER = 1
	IDENT = 2
	STRING = 3
	# Key phrases.
	LABEL = 101
	GOTO = 102
	PRINT = 103
	INPUT = 104
	LET = 105
	IF = 106
	THEN = 107
	ENDIF = 108
	WHILE = 109
	REPEAT = 110
	ENDWHILE = 111
	# Operators.
	EQ = 201  
	PLUS = 202
	MINUS = 203
	ASTERISK = 204
	SLASH = 205
	EQEQ = 206
	NOTEQ = 207
	LT = 208
	LTEQ = 209
	GT = 210
	GTEQ = 211

Now we are able to increase getToken to really do one thing when it detects a particular token:

    # Return the subsequent token.
    def getToken(self):
        token = None

        # Verify the primary character of this token to see if we are able to resolve what it's.
        # If it's a a number of character operator (e.g., !=), quantity, identifier, or key phrase then we are going to course of the remaining.
        if self.curChar == '+':
            token = Token(self.curChar, TokenType.PLUS)
        elif self.curChar == '-':
            token = Token(self.curChar, TokenType.MINUS)
        elif self.curChar == '*':
            token = Token(self.curChar, TokenType.ASTERISK)
        elif self.curChar == '/':
            token = Token(self.curChar, TokenType.SLASH)
        elif self.curChar == 'n':
            token = Token(self.curChar, TokenType.NEWLINE)
        elif self.curChar == '':
            token = Token('', TokenType.EOF)
        else:
            # Unknown token!
            go
			
        self.nextChar()
        return token

This code units up the lexer to detect the essential arithmetic operators together with new strains and the top of file marker. The else clause is for capturing the whole lot that will not be allowed.

Let’s change principal to see whether or not this works or not up to now:

def principal():
    supply = "+- */"
    lexer = Lexer(supply)

    token = lexer.getToken()
    whereas token.form != TokenType.EOF:
        print(token.form)
        token = lexer.getToken()

When you run this, it is best to see one thing like:

TokenType.PLUS
TokenType.MINUS
Traceback (most up-to-date name final):
  File "e:/tasks/teenytiny/part1/teenytiny.py", line 12, in 
    principal()
  File "e:/tasks/teenytiny/part1/teenytiny.py", line 8, in principal
    whereas token.form != TokenType.EOF:
AttributeError: 'NoneType' object has no attribute 'form'

Uhoh! One thing went mistaken. The one means getToken returns None is that if the else department is taken. We should always deal with this somewhat higher. Add import sys to the highest of lex.py and outline the abort operate like:

    # Invalid token discovered, print error message and exit.
    def abort(self, message):
        sys.exit("Lexing error. " + message)

And substitute the else in getToken with:

        else:
            # Unknown token!
            self.abort("Unknown token: " + self.curChar)

Now run this system once more…

TokenType.PLUS
TokenType.MINUS
Lexing error. Unknown token: 

There may be nonetheless a problem, however now we are able to make somewhat extra sense of it. It seems like one thing went mistaken after the primary two tokens. The unknown token is invisible. Trying again on the enter string, you might discover we aren’t dealing with whitespace! We have to implement the skipWhitespace operate:

    # Skip whitespace besides newlines, which we are going to use to point the top of a press release.
    def skipWhitespace(self):
        whereas self.curChar == ' ' or self.curChar == 't' or self.curChar == 'r':
            self.nextChar()

Now put self.skipWhitespace() as the primary line of getToken. Run this system and it is best to see the output:

TokenType.PLUS
TokenType.MINUS
TokenType.ASTERISK
TokenType.SLASH
TokenType.NEWLINE

Progress!

At this level, we are able to transfer on to lexing the operators which might be made up of two characters, akin to == and >=. All of those operators will likely be lexed in the identical style: verify the primary character, then peek on the second character to see what it’s earlier than deciding what to do. Add this after the elif for the SLASH token in getToken:

        elif self.curChar == '=':
            # Verify whether or not this token is = or ==
            if self.peek() == '=':
                lastChar = self.curChar
                self.nextChar()
                token = Token(lastChar + self.curChar, TokenType.EQEQ)
            else:
                token = Token(self.curChar, TokenType.EQ)

Utilizing the peek operate permits us to have a look at what the subsequent character will likely be with out discarding the curChar. Right here is the code for the remaining operators which work the identical means:

        elif self.curChar == '>':
            # Verify whether or not that is token is > or >=
            if self.peek() == '=':
                lastChar = self.curChar
                self.nextChar()
                token = Token(lastChar + self.curChar, TokenType.GTEQ)
            else:
                token = Token(self.curChar, TokenType.GT)
        elif self.curChar == '<':
                # Verify whether or not that is token is < or <=
                if self.peek() == '=':
                    lastChar = self.curChar
                    self.nextChar()
                    token = Token(lastChar + self.curChar, TokenType.LTEQ)
                else:
                    token = Token(self.curChar, TokenType.LT)
        elif self.curChar == '!':
            if self.peek() == '=':
                lastChar = self.curChar
                self.nextChar()
                token = Token(lastChar + self.curChar, TokenType.NOTEQ)
            else:
                self.abort("Anticipated !=, obtained !" + self.peek())

The one operator that could be a bit completely different is !=. That’s as a result of the ! character isn’t legitimate by itself, so it should be adopted by =. The opposite characters are legitimate on their very own, however the lexer is grasping and can settle for it as one of many multi-character operators if potential.

We are able to take a look at these operators by updating the enter to “+- */ >>= = !=” which ought to provide the following output once you run this system:

TokenType.PLUS
TokenType.MINUS
TokenType.ASTERISK
TokenType.SLASH
TokenType.GT
TokenType.GTEQ
TokenType.EQ
TokenType.NOTEQ
TokenType.NEWLINE

This system now accepts all the language’s operators. So what’s left? We have to add help for feedback, strings, numbers, identifiers, and key phrases. Let’s work by these one after the other and take a look at as we go.

The # character will point out the beginning of a remark. Every time the lexer sees it, we all know to disregard all of the textual content after it till a newline. Feedback will not be tokens, however the lexer will discard all this textual content in order that it will possibly discover the subsequent factor we care about. It’s also vital that we do not throw away the newline on the finish of the remark since that’s its personal token and should be wanted. Fill in skipComment:

    # Skip feedback within the code.
    def skipComment(self):
        if self.curChar == '#':
            whereas self.curChar != 'n':
                self.nextChar()

Straightforward sufficient! Now name it from nextToken, such that the primary few strains of the operate appear like:

    # Return the subsequent token.
    def getToken(self):
        self.skipWhitespace()
        self.skipComment()
        token = None
	...

Check it out with the enter “+- # It is a remark!n */” and it is best to see:

TokenType.PLUS
TokenType.MINUS
TokenType.NEWLINE
TokenType.ASTERISK
TokenType.SLASH
TokenType.NEWLINE

Discover that the remark is totally ignored!

Our language helps printing a string, which begins with a double citation mark and continues till one other citation mark. We cannot permit some particular characters to make it simpler to compile to C afterward. Add the next code to getToken‘s huge block of else if statements:

        elif self.curChar == '"':
            # Get characters between quotations.
            self.nextChar()
            startPos = self.curPos

            whereas self.curChar != '"':
                # Do not permit particular characters within the string. No escape characters, newlines, tabs, or %.
                # We will likely be utilizing C's printf on this string.
                if self.curChar == 'r' or self.curChar == 'n' or self.curChar == 't' or self.curChar == '' or self.curChar == '%':
                    self.abort("Unlawful character in string.")
                self.nextChar()

            tokText = self.supply[startPos : self.curPos] # Get the substring.
            token = Token(tokText, TokenType.STRING)

You will see the code is only a whereas loop that continues till the second citation mark. It will abort with an error message if any of the invalid characters are discovered. One thing completely different from the opposite tokens we now have coated up to now: we set the token’s textual content to the content material of the string (minus the citation marks).

Replace the enter once more with “+- “It is a string” # It is a remark!n */” and run this system:

TokenType.PLUS
TokenType.MINUS
TokenType.STRING
TokenType.NEWLINE
TokenType.ASTERISK
TokenType.SLASH
TokenType.NEWLINE

Transferring proper alongside to numbers. Our language defines a quantity as a number of digits (0-9) adopted by an optionally available decimal level that should be adopted by not less than one digit. So 48 and three.14 are allowed however .9 and 1. will not be allowed. We are going to use the peek operate once more to look forward one character. Just like the string token, we maintain monitor of the beginning and finish factors of the numbers in order that we are able to set the token’s textual content to the precise quantity.

        elif self.curChar.isdigit():
            # Main character is a digit, so this should be a quantity.
            # Get all consecutive digits and decimal if there may be one.
            startPos = self.curPos
            whereas self.peek().isdigit():
                self.nextChar()
            if self.peek() == '.': # Decimal!
                self.nextChar()

                # Will need to have not less than one digit after decimal.
                if not self.peek().isdigit(): 
                    # Error!
                    self.abort("Unlawful character in quantity.")
                whereas self.peek().isdigit():
                    self.nextChar()

            tokText = self.supply[startPos : self.curPos + 1] # Get the substring.
            token = Token(tokText, TokenType.NUMBER)

Check it out with the enter “+-123 9.8654*/” and it is best to see:

TokenType.PLUS
TokenType.MINUS
TokenType.NUMBER
TokenType.NUMBER
TokenType.ASTERISK
TokenType.SLASH
TokenType.NEWLINE

Nice, we’re virtually achieved with the lexer!

The final huge factor is to deal with identifiers and key phrases. The foundations for an identifier is something that begins with a alphabetic characters adopted by zero or extra alphanumeric characters. However earlier than we name it a TokenType.IDENT, we now have to ensure it is not certainly one of our key phrases. Add this to getToken:

        elif self.curChar.isalpha():
            # Main character is a letter, so this should be an identifier or a key phrase.
            # Get all consecutive alpha numeric characters.
            startPos = self.curPos
            whereas self.peek().isalnum():
                self.nextChar()

            # Verify if the token is within the record of key phrases.
            tokText = self.supply[startPos : self.curPos + 1] # Get the substring.
            key phrase = Token.checkIfKeyword(tokText)
            if key phrase == None: # Identifier
                token = Token(tokText, TokenType.IDENT)
            else:   # Key phrase
                token = Token(tokText, key phrase)

Pretty much like the opposite tokens. However we have to outline checkIfKeyword within the Token class:

    @staticmethod
    def checkIfKeyword(tokenText):
        for form in TokenType:
            # Depends on all key phrase enum values being 1XX.
            if form.title == tokenText and type.worth >= 100 and type.worth < 200:
                return form
        return None

This simply checks whether or not the token is within the record of key phrases, which we now have arbitrarily set to having 101-199 as their enum values.

Alright, take a look at identifiers and key phrases with the enter string “IF+-123 foo*THEN/”

TokenType.IF
TokenType.PLUS
TokenType.MINUS
TokenType.NUMBER
TokenType.IDENT
TokenType.ASTERISK
TokenType.THEN
TokenType.SLASH
TokenType.NEWLINE

And what the output seems like from the terminal:

There we now have it. Our lexer can appropriately establish each token that our language wants! We’ve efficiently accomplished the primary module of our compiler.

When you assume that is underwhelming, do not hand over but! I feel the lexer is definitely essentially the most tedious but least fascinating a part of compilers. Subsequent up we are going to parse the code, that’s make sure that the tokens are in an order that is sensible, after which we are going to emit code.

The supply code up to now will be present in its entirety within the Github repo.


Proceed on to part 2 of this tutorial. Different really useful studying:

There are Amazon affiliate hyperlinks on this web page.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top