Let’s make a Teeny Tiny compiler, half 1

Austin Z. Henley
I work on AI + dev instruments.
5/5/2020
That is the primary publish in a 3 half sequence. Take a look at part 2 and part 3 when you find yourself prepared.
It’s a lovely day exterior, so let’s make a compiler. You do not want any data of how compilers work to comply with alongside. We’re going to use Python to implement our personal programming language, Teeny Tiny, that may compile to C code. It is going to take about 500 strains of code and supply the preliminary infrastructure wanted to customise and lengthen the compiler into your personal billion greenback production-ready compiler.
This tutorial is a sequence of posts that go step-by-step in constructing a working compiler. All the supply code will be discovered within the GitHub repo. When you comply with together with all of the posts, I guesstimate that it’s going to take you only some hours.
The Teeny Tiny language we’re going to implement is a dialect of BASIC. The syntax is clear and easy. When you desire a C-like syntax then it will likely be trivial to switch the compiler on the finish. Right here is an instance program in our Teeny Tiny language:
PRINT "What number of fibonacci numbers would you like?" INPUT nums LET a = 0 LET b = 1 WHILE nums > 0 REPEAT PRINT a LET c = a + b LET a = b LET b = c LET nums = nums - 1 ENDWHILE
This program prints out phrases of the fibonacci sequence based mostly on the consumer’s enter: 0 1 1 2 3 5 8 13…
Our language will permit quite a lot of the essential operations that you simply’d anticipate from a programming language. Particularly, it’s going to help:
- Numerical variables
- Primary arithmetic
- If statements
- Whereas loops
- Print textual content and numbers
- Enter numbers
- Labels and goto
- Feedback
Though it is a commonplace subset of options, you might discover that there aren’t any features, no arrays, no method to learn/write from a file, and never even an else assertion. However with simply this small set of constructs, you may truly do so much. It is going to additionally setup the compiler in such a means that many different options will likely be straight ahead so as to add later.
Compiler Overview
Our compiler will comply with a 3 step course of that’s illustrated above. First, given the inputted supply code, it’s going to break the code up into tokens. These are like phrases and punctuation in English. Second, it’s going to parse the tokens to ensure they’re in an order that’s allowed in our language. Similar to English sentences comply with particular buildings of verbs and nouns. Third, it’s going to emit the C code that our language will translate to.
We are going to use these three steps as the primary group for our code. The lexer, parser, and emitter will every have their very own Python code file. This tutorial is damaged up into 3 elements based mostly on these steps as nicely. When you had been to increase the compiler, there are some extra steps you’ll add, however we are going to maintain off on discussing these.
Lexer Overview
The primary module of our compiler is known as the lexer. Given a string of Teeny Tiny code, it’s going to iterate character by character to do two issues: resolve the place every token begins/stops and what sort of token it’s. If the lexer is unable to do that, then it’s going to report an error for an invalid token.
The determine demonstrates an instance enter and output of the lexer. Given the Teeny Tiny code, the lexer should decide the place the tokens are together with the kind (e.g., key phrase). You may see that areas aren’t acknowledged as tokens, however the lexer will use them as one method to know when a token ends.
Let’s lastly get into some code, beginning with the construction of the lexer within the file lex.py:
class Lexer: def __init__(self, supply): go # Course of the subsequent character. def nextChar(self): go # Return the lookahead character. def peek(self): go # Invalid token discovered, print error message and exit. def abort(self, message): go # Skip whitespace besides newlines, which we are going to use to point the top of a press release. def skipWhitespace(self): go # Skip feedback within the code. def skipComment(self): go # Return the subsequent token. def getToken(self): go
I prefer to sketch out all of the features that I feel I’ll want, then return and fill them in. The operate getToken would be the meat of the lexer. Will probably be known as every time the compiler is prepared for the subsequent token and it’ll do the work of classifying tokens. nextChar and peek are helper features for trying on the subsequent character. skipWhitespace consumes the areas and tabs that we do not care about. abort is what we are going to use to report an invalid token.
The lexer must maintain monitor of the present place within the enter string and the character at that place. We are going to initialize these within the constructor:
def __init__(self, supply): self.supply = supply + 'n' # Supply code to lex as a string. Append a newline to simplify lexing/parsing the final token/assertion. self.curChar="" # Present character within the string. self.curPos = -1 # Present place within the string. self.nextChar()
The lexer wants the enter code and we append a newline to it (this simply simplifies some checks afterward). curChar is what the lexer will continually verify to resolve what sort of token it’s. Why not simply do supply[curPos]? As a result of that may litter the code with bounds checking. As a substitute we do that in nextChar:
# Course of the subsequent character. def nextChar(self): self.curPos += 1 if self.curPos >= len(self.supply): self.curChar=" " # EOF else: self.curChar = self.supply[self.curPos]
This increments the lexer’s present place and updates the present character. If we attain the top of the enter, set the character to the end-of-file marker. That is the one place we are going to modify curPos and curChar. However generally we wish to sit up for the subsequent character with out updating curPos:
# Return the lookahead character. def peek(self): if self.curPos + 1 >= len(self.supply): return ' ' return self.supply[self.curPos+1]
We should always make sure that these features work. Let’s take a look at them by create a brand new file teenytiny.py:
from lex import * def principal(): supply = "LET foobar = 123" lexer = Lexer(supply) whereas lexer.peek() != ' ': print(lexer.curChar) lexer.nextChar() principal()
Run this and the output must be each character of the enter string, LET foobar = 123, on a brand new line:
L E T f o o b a r = 1 2 3
Classifying Tokens
However we do not simply need characters, we wish tokens! We have to plan how combining particular person characters collectively makes a token, which works very like a state machine. Listed here are the primary lexer guidelines for the Teeny Tiny language:
- Operator. One or two consecutive characters that matches: + – * / = == != > < >= <=
- String. Double citation adopted by zero or extra characters and a double citation. Akin to: “good day, world!” and “”
- Quantity. A number of numeric characters adopted by an optionally available decimal level and a number of numeric characters. Akin to: 15 and three.14
- Identifier. An alphabetical character adopted by zero or extra alphanumeric characters.
- Key phrase. Actual textual content match of: LABEL, GOTO, PRINT, INPUT, LET, IF, THEN, ENDIF, WHILE, REPEAT, ENDWHILE
Subsequent we are going to begin our getToken operate in our Lexer class:
# Return the subsequent token. def getToken(self): # Verify the primary character of this token to see if we are able to resolve what it's. # If it's a a number of character operator (e.g., !=), quantity, identifier, or key phrase then we are going to course of the remaining. if self.curChar == '+': go # Plus token. elif self.curChar == '-': go # Minus token. elif self.curChar == '*': go # Asterisk token. elif self.curChar == '/': go # Slash token. elif self.curChar == 'n': go # Newline token. elif self.curChar == ' ': go # EOF token. else: # Unknown token! go self.nextChar()
It will detect a couple of potential tokens, however does not do something helpful but. What we want subsequent is a Token class to maintain monitor of what sort of token it’s and the precise textual content from the code. Place this in lex.py for now:
# Token incorporates the unique textual content and the kind of token. class Token: def __init__(self, tokenText, tokenKind): self.textual content = tokenText # The token's precise textual content. Used for identifiers, strings, and numbers. self.form = tokenKind # The TokenType that this token is classed as.
To specify what sort a token is, we are going to create the TokenType class as an enum. It seems lengthy, nevertheless it simply specifies each potential token our language permits. Add import enum to the highest of lex.py and add this class:
# TokenType is our enum for all of the kinds of tokens. class TokenType(enum.Enum): EOF = -1 NEWLINE = 0 NUMBER = 1 IDENT = 2 STRING = 3 # Key phrases. LABEL = 101 GOTO = 102 PRINT = 103 INPUT = 104 LET = 105 IF = 106 THEN = 107 ENDIF = 108 WHILE = 109 REPEAT = 110 ENDWHILE = 111 # Operators. EQ = 201 PLUS = 202 MINUS = 203 ASTERISK = 204 SLASH = 205 EQEQ = 206 NOTEQ = 207 LT = 208 LTEQ = 209 GT = 210 GTEQ = 211
Now we are able to increase getToken to really do one thing when it detects a particular token:
# Return the subsequent token. def getToken(self): token = None # Verify the primary character of this token to see if we are able to resolve what it's. # If it's a a number of character operator (e.g., !=), quantity, identifier, or key phrase then we are going to course of the remaining. if self.curChar == '+': token = Token(self.curChar, TokenType.PLUS) elif self.curChar == '-': token = Token(self.curChar, TokenType.MINUS) elif self.curChar == '*': token = Token(self.curChar, TokenType.ASTERISK) elif self.curChar == '/': token = Token(self.curChar, TokenType.SLASH) elif self.curChar == 'n': token = Token(self.curChar, TokenType.NEWLINE) elif self.curChar == ' ': token = Token('', TokenType.EOF) else: # Unknown token! go self.nextChar() return token
This code units up the lexer to detect the essential arithmetic operators together with new strains and the top of file marker. The else clause is for capturing the whole lot that will not be allowed.
Let’s change principal to see whether or not this works or not up to now:
def principal(): supply = "+- */" lexer = Lexer(supply) token = lexer.getToken() whereas token.form != TokenType.EOF: print(token.form) token = lexer.getToken()
When you run this, it is best to see one thing like:
TokenType.PLUS TokenType.MINUS Traceback (most up-to-date name final): File "e:/tasks/teenytiny/part1/teenytiny.py", line 12, inprincipal() File "e:/tasks/teenytiny/part1/teenytiny.py", line 8, in principal whereas token.form != TokenType.EOF: AttributeError: 'NoneType' object has no attribute 'form'
Uhoh! One thing went mistaken. The one means getToken returns None is that if the else department is taken. We should always deal with this somewhat higher. Add import sys to the highest of lex.py and outline the abort operate like:
# Invalid token discovered, print error message and exit. def abort(self, message): sys.exit("Lexing error. " + message)
And substitute the else in getToken with:
else: # Unknown token! self.abort("Unknown token: " + self.curChar)
Now run this system once more…
TokenType.PLUS TokenType.MINUS Lexing error. Unknown token:
There may be nonetheless a problem, however now we are able to make somewhat extra sense of it. It seems like one thing went mistaken after the primary two tokens. The unknown token is invisible. Trying again on the enter string, you might discover we aren’t dealing with whitespace! We have to implement the skipWhitespace operate:
# Skip whitespace besides newlines, which we are going to use to point the top of a press release. def skipWhitespace(self): whereas self.curChar == ' ' or self.curChar == 't' or self.curChar == 'r': self.nextChar()
Now put self.skipWhitespace() as the primary line of getToken. Run this system and it is best to see the output:
TokenType.PLUS TokenType.MINUS TokenType.ASTERISK TokenType.SLASH TokenType.NEWLINE
Progress!
At this level, we are able to transfer on to lexing the operators which might be made up of two characters, akin to == and >=. All of those operators will likely be lexed in the identical style: verify the primary character, then peek on the second character to see what it’s earlier than deciding what to do. Add this after the elif for the SLASH token in getToken:
elif self.curChar == '=': # Verify whether or not this token is = or == if self.peek() == '=': lastChar = self.curChar self.nextChar() token = Token(lastChar + self.curChar, TokenType.EQEQ) else: token = Token(self.curChar, TokenType.EQ)
Utilizing the peek operate permits us to have a look at what the subsequent character will likely be with out discarding the curChar. Right here is the code for the remaining operators which work the identical means:
elif self.curChar == '>': # Verify whether or not that is token is > or >= if self.peek() == '=': lastChar = self.curChar self.nextChar() token = Token(lastChar + self.curChar, TokenType.GTEQ) else: token = Token(self.curChar, TokenType.GT) elif self.curChar == '<': # Verify whether or not that is token is < or <= if self.peek() == '=': lastChar = self.curChar self.nextChar() token = Token(lastChar + self.curChar, TokenType.LTEQ) else: token = Token(self.curChar, TokenType.LT) elif self.curChar == '!': if self.peek() == '=': lastChar = self.curChar self.nextChar() token = Token(lastChar + self.curChar, TokenType.NOTEQ) else: self.abort("Anticipated !=, obtained !" + self.peek())
The one operator that could be a bit completely different is !=. That’s as a result of the ! character isn’t legitimate by itself, so it should be adopted by =. The opposite characters are legitimate on their very own, however the lexer is grasping and can settle for it as one of many multi-character operators if potential.
We are able to take a look at these operators by updating the enter to “+- */ >>= = !=” which ought to provide the following output once you run this system:
TokenType.PLUS TokenType.MINUS TokenType.ASTERISK TokenType.SLASH TokenType.GT TokenType.GTEQ TokenType.EQ TokenType.NOTEQ TokenType.NEWLINE
This system now accepts all the language’s operators. So what’s left? We have to add help for feedback, strings, numbers, identifiers, and key phrases. Let’s work by these one after the other and take a look at as we go.
The # character will point out the beginning of a remark. Every time the lexer sees it, we all know to disregard all of the textual content after it till a newline. Feedback will not be tokens, however the lexer will discard all this textual content in order that it will possibly discover the subsequent factor we care about. It’s also vital that we do not throw away the newline on the finish of the remark since that’s its personal token and should be wanted. Fill in skipComment:
# Skip feedback within the code. def skipComment(self): if self.curChar == '#': whereas self.curChar != 'n': self.nextChar()
Straightforward sufficient! Now name it from nextToken, such that the primary few strains of the operate appear like:
# Return the subsequent token. def getToken(self): self.skipWhitespace() self.skipComment() token = None ...
Check it out with the enter “+- # It is a remark!n */” and it is best to see:
TokenType.PLUS TokenType.MINUS TokenType.NEWLINE TokenType.ASTERISK TokenType.SLASH TokenType.NEWLINE
Discover that the remark is totally ignored!
Our language helps printing a string, which begins with a double citation mark and continues till one other citation mark. We cannot permit some particular characters to make it simpler to compile to C afterward. Add the next code to getToken‘s huge block of else if statements:
elif self.curChar == '"': # Get characters between quotations. self.nextChar() startPos = self.curPos whereas self.curChar != '"': # Do not permit particular characters within the string. No escape characters, newlines, tabs, or %. # We will likely be utilizing C's printf on this string. if self.curChar == 'r' or self.curChar == 'n' or self.curChar == 't' or self.curChar == '' or self.curChar == '%': self.abort("Unlawful character in string.") self.nextChar() tokText = self.supply[startPos : self.curPos] # Get the substring. token = Token(tokText, TokenType.STRING)
You will see the code is only a whereas loop that continues till the second citation mark. It will abort with an error message if any of the invalid characters are discovered. One thing completely different from the opposite tokens we now have coated up to now: we set the token’s textual content to the content material of the string (minus the citation marks).
Replace the enter once more with “+- “It is a string” # It is a remark!n */” and run this system:
TokenType.PLUS TokenType.MINUS TokenType.STRING TokenType.NEWLINE TokenType.ASTERISK TokenType.SLASH TokenType.NEWLINE
Transferring proper alongside to numbers. Our language defines a quantity as a number of digits (0-9) adopted by an optionally available decimal level that should be adopted by not less than one digit. So 48 and three.14 are allowed however .9 and 1. will not be allowed. We are going to use the peek operate once more to look forward one character. Just like the string token, we maintain monitor of the beginning and finish factors of the numbers in order that we are able to set the token’s textual content to the precise quantity.
elif self.curChar.isdigit(): # Main character is a digit, so this should be a quantity. # Get all consecutive digits and decimal if there may be one. startPos = self.curPos whereas self.peek().isdigit(): self.nextChar() if self.peek() == '.': # Decimal! self.nextChar() # Will need to have not less than one digit after decimal. if not self.peek().isdigit(): # Error! self.abort("Unlawful character in quantity.") whereas self.peek().isdigit(): self.nextChar() tokText = self.supply[startPos : self.curPos + 1] # Get the substring. token = Token(tokText, TokenType.NUMBER)
Check it out with the enter “+-123 9.8654*/” and it is best to see:
TokenType.PLUS TokenType.MINUS TokenType.NUMBER TokenType.NUMBER TokenType.ASTERISK TokenType.SLASH TokenType.NEWLINE
Nice, we’re virtually achieved with the lexer!
The final huge factor is to deal with identifiers and key phrases. The foundations for an identifier is something that begins with a alphabetic characters adopted by zero or extra alphanumeric characters. However earlier than we name it a TokenType.IDENT, we now have to ensure it is not certainly one of our key phrases. Add this to getToken:
elif self.curChar.isalpha(): # Main character is a letter, so this should be an identifier or a key phrase. # Get all consecutive alpha numeric characters. startPos = self.curPos whereas self.peek().isalnum(): self.nextChar() # Verify if the token is within the record of key phrases. tokText = self.supply[startPos : self.curPos + 1] # Get the substring. key phrase = Token.checkIfKeyword(tokText) if key phrase == None: # Identifier token = Token(tokText, TokenType.IDENT) else: # Key phrase token = Token(tokText, key phrase)
Pretty much like the opposite tokens. However we have to outline checkIfKeyword within the Token class:
@staticmethod def checkIfKeyword(tokenText): for form in TokenType: # Depends on all key phrase enum values being 1XX. if form.title == tokenText and type.worth >= 100 and type.worth < 200: return form return None
This simply checks whether or not the token is within the record of key phrases, which we now have arbitrarily set to having 101-199 as their enum values.
Alright, take a look at identifiers and key phrases with the enter string “IF+-123 foo*THEN/”
TokenType.IF TokenType.PLUS TokenType.MINUS TokenType.NUMBER TokenType.IDENT TokenType.ASTERISK TokenType.THEN TokenType.SLASH TokenType.NEWLINE
And what the output seems like from the terminal:
There we now have it. Our lexer can appropriately establish each token that our language wants! We’ve efficiently accomplished the primary module of our compiler.
When you assume that is underwhelming, do not hand over but! I feel the lexer is definitely essentially the most tedious but least fascinating a part of compilers. Subsequent up we are going to parse the code, that’s make sure that the tokens are in an order that is sensible, after which we are going to emit code.
The supply code up to now will be present in its entirety within the Github repo.
Proceed on to part 2 of this tutorial. Different really useful studying:
There are Amazon affiliate hyperlinks on this web page.